Create Cluster Definition File

In this part you’ll create a declarative definition of your cluster. The hefty chunk of the section is dedicated to understanding of the GCE infrastructure conventions and limits that affect performance of the NFS shared disk.

We are still assuming your current directory ./ is the root BurrMill directory.

Define your cluster

BurrMill cluster is defined in a single YAML^✼YAML is in fact a superset of JSON: any valid JSON file is also a valid YAML file. This works one way only: it’s a very rich superset. The extension is far beyond formatting: YAML supports strict types and cyclic object graphs. file. It’s ok if you never used or understood YAML: it’s just an indented text file mostly consisting of key: value pairs; some of them have a hyphen and space - in front, some don’t. Indent in YAML is as important as it is in Python.

BurrMill comes with a pair of tools in its ./bin directory, so that they are always on your PATH: y2j and j2y, and your guess what they do is entirely correct. It’s always a good idea to validate a YAML file you’ve been working on with the y2j filename command (or y2j >/dev/null filename if the file filename is large). Both tools have a --help option.

Come up with a naming scheme

As we went through configuring SSH, you understand how pattern matching is used in its configuration. The name of the cluster is also prefix to any host node in that cluster: if a cluster is named xc, its login node is named xc-login, and an ephemeral computing node may be named xc-node-p100-4. Besides the normal DNS rules of naming hosts (lowercase Latin and digits), we do not permit the hyphen - in the cluster name; you’ve already guessed why from the above examples. Personally I use and recommend short names, since I have to type their names to connect: any time I’ll prefer ssh xc-login to ssh cluster1uscentral1-login. Since I’m computing in the us location, in 3 regions, I use a naming scheme of two letters: first is x, because it’s the least frequent word-starter in English (hint, hint: q is the second least frequent), followed by the second letter standing for a region where the cluster is located, reserving two more neighboring letters in case I need to deploy another cluster in the same region: a,b,c for us-central; d,e,f for us-east and u,w,y for us-west. Bad luck c is too close to e, so the first set of the three is not like the other two, but still easy to mnemonize. You can use any system (e.g., qc1,qc2 or qc3 for location-central), but it must be easy to match out with the SSH patterns and unlikely to spuriously match any other host you use SSH to connect, either directly or indirectly, as, for example, a remote Git protocol. My pattern would be x?-*, the second example either q??-* or maybe q?1-*,q?2-*,q?3-*, if your config file has a large number of destinations and you want to narrow the matching.

You already know where your first cluster will be (and it will be easy to change in the unlikely case quota request will be returned denied), start with its name; without loss of generality, let’s assume your cluster is named qe because there is not a 2-letter word starting with the “q”, and your preferred region is asia-east1 (located in Taiwan)^✼One of the signs that you’ve been hanging around GCP too much is that you can immediately tell the geographic location of a region given its codename. Oops…, thus the “e” for “east.” Copy our template and name it the same as your cluster. BurrMill tools will get angry at you if the name of the file does not match the name declared in that file: this is to reduce your chance of messing up and deploying, or, worse, removing one cluster when you wanted to kill another.

$ cp lib/cluster/TEMPLATE.cluster.yaml etc/cluster/qe.cluster.yaml

then edit the file (it’s well commented), insert qe as the name, and set the correct zone.

Size the shared NFS disk and NFS server

The two most important factors that you must tune to your workload is the size of the shared disk, and the size of the NFS server, both in terms of number of its vCPUs and (somewhat less important) its RAM size.

For someone coming from the physical server world, this sounds… interesting. You would go after a faster disk, not the largest volume. You would assemble a striped array from multiple disks. And RAM available for file caching in a physical machine is much more important than the number of CPU cores: a file server CPU sits nearly idle, as most modern network and disk controllers do their job in hardware.

Surprisingly, in the GCE virtual world only the last statement, about a nearly-idle CPU, is true. Your physical hardware experience betrays you.

These guidelines are part calculation, part experience. The best advice I can give is monitor performance and adjust the size as you go.

Factors affecting NFS server throughput

Feel free to skip this subchapter covering the theoretical aspects of the throughput estimation, but nick it down to return and read at leisure: this summary will help you fine-tune the performance when it comes to that.

There is no NIC or SCSI host adapter

Naturally, there is no real disk connected to your VM. The data written and read to a persistent disk travel over the network to a storage backend. But that data is not passing through the network interface: there is no network interface, of course, too, this is also a virtual contraption that the running OS think is an Ethernet card. The OS sees the disk as a normal SCSI disk device (/dev/sda, for example). But GCE counts the total traffic through both disk interface and the network interface together. The picture differs, albeit slightly, between pd-standard (magnetic) and pd-ssd disk; the former has a single additional constraint that the latter does not. Let’s first see what are the the interactions of various moving parts work at the outside of the GCE machinery: we cannot, and need not any internal knowledge of what is really going on under this abstraction. A laundry list follows; our delicate optimization dance should optimize the parameters. All numbers refer to pd-standard, magnetic storage^✼It may not or may not be in fact magnetic. Anyway, this type seems optimal for our use pattern: we read and write large files. SSD shines on random small accesses, but dollar for dollar, has a close throughput figures, but 4 times less capacity.. Do not confuse MB/s or GB/s, which is mega- or gigabytes per second, the unit in which throughput, or the rate of actual bytes read or written is measured, with Mbps or Gbps, which is bits per second, which is the total network bandwidth. In the ideal world, the former would be 1/8 of the latter. However, due to protocol overhead, the throughput never reaches this potential limit. We’ll be using B/s and bps, but never ~~Bps~~ or ~~b/s~~ as the abbreviations for these dimensional measures.

Disk total throughput: linear, 120MB/s per 1TB of disk size for pd-standard, but capped at
1. 1200MB/s for read, corresponding to at 10TB disk; this number is also the total maximum throughput achievable.
2. 400MB/s for write, afforded by a 3.4TB disk. Thus, 4TB or slightly more (to account for the fact that not all operations are writes) appears to be the largest practical NFS pd-standard disk size, if the goal is performance, not the data volume^✼You’ll hardly fill this disk even in multiple large experiments, and it’s a good practice to offload archived data to the cheap GS buckets..
Simultaneous reads and writes: the rule of thumb is that 1MB/s of write throughput takes away 3MB/s of read throughput. Thus, a single disk (not limited by size) can perform at 1200MB/s on sole reads, or 400MB/s on sole writes. This is essentially the same limit as 1(a) vs 1(b) above: you can think of 1200MB/s as the total bandwidth for reads and writes, with the tripling requirement for writes. Thus, for example, a disk can sustain 200MB/s of data writes and 600MB/s of data reads, for the total of 3×200+1×600=1200MB/s, or 300MB/s data write and 300MB/s of data read.
VM egress (outgoing bandwidth) cap. 6 months ago, this cap was linear in the number of vCPU, 2Gbps per 1 vCPU. Later, Google increased the egress to flat 10Gbps for vCPU count from 2 to 4; for the rest of the range (5+ vCPUs), the linearity still works. However, the linearity stops when the VM reaches the hard saturation limit, which depends on the machine platform, and is either 16Gbps (corresponding to 8 vCPU) or 32Gbps (16 vCPU):
- N1 machines with the Skylake or higher CPU^✼N1 machines support setting the lower bound on core type selection.. N1 Skylake is now available world-wide, and are a good choice for the largest and fastest NFS server.
- N1 with older CPU (Haswell, Broadwell…), and low-cost E2 machines are capped at 16Gbps. E2 appear the most economical choice for a smaller NFS server.
- All N2, C2 and N2D machines cap out at 32Gbps with 16vCPU. Currently, all 3 types require a separate quota for each type, only for non-preemptible use, and offer no benefit when used in the NFS server.
Write bandwidth, determined partially (specifically, assuming no IP egress, which is never the case for a file server!) by network egress capacity. Google recommends adding 10% for protocol overhead. Since all disk writes are tripled because of redundancy encoding, accounting for the overhead, the maximum achievable write bandwidth is 1024/(3.3×8)≈39MB/s per 1Gbps of egress limit, or 76MB/s per vCPU count above 5 and up to the maximum described above: 600 MB/s for 8-vCPU E2 machine, or 1200 MB/s for 16-vCPU N1 Skylake machine. 1200MB/s is thus the absolute maximum achievable write bandwidth. Thus, even a 8-vCPU machine is capable of saturating the 400MB/s write limit of a pd-standard disk. However…
…remember that for the NFS server, total egress must be split between disk writes and serving data to clients. When VM’s egress bandwidth is saturated, the 60/40% rule is applied: 60% of bandwidth is dedicated to disk writes, and 40% to network traffic. This is the reason to increase the vCPU count of the NFS server beyond disk write saturation. In fact, the whole available RAM of the machine is used as the disk cache, and many recently accessed files may be served from memory. We also configure mount option for the shared NFS disk such that the writes are lazy, and ext4 journaling is set to non-ordered^✼This means that a recovery from machine crash does not guarantee that the disk will be recovered as it would be at a consistent time point: some later writes may be committed, while earlier writes rolled back, even if to areas of the same file. Fortunately, we have never observed a single kernel panic event. A single-role NFS server VM is very stable. for performance.

Is there a takeaway from these heaped details? Well, you should think about the pattern of shared disk use in Kaldi. First, there are operations, like feature generation, which may be more CPU intensive for large computations in production models, where a lot of DSP is used to augment data, or more disk-bound loads if you just perform extraction of low-res unaugmented features. Then, speaking of neural networks, we usually bootstrap models in a traditional way, with the sequence of more and more advanced GMM-mixture models.

The common pattern for both tasks is the larger is the model, the more CPU-intensive it is. These parallelize quite well, and still the effect is noticeable. The bootstrap phase tends to compress the range of potential NFS utilization: small models tend to be relatively more disk-intensive, and the large models incline toward heavier CPU and less disk use. This does not mean that there can be one-size-fits-all NFS server, but the range of the demand of its services scales slower than the overall training job size. Also, forced alignment, being essentially a decode, does not have this compressing nature.

The second use pattern is the data preparation for the DNN training. This job is so hopelessly disk-bound, that the NFS throughput is practically the only factor that determines its runtime. But do not rush to build the largest server possible: even with a few mistakes (too few alignment jobs on the previous steps lead to a low degree of parallelism in cegs mixing), I clocked about 40 minutes to produce 600GB of cegs archives, using just a 6-vCPU, 2TB NFS server. Also, since this job is so much disproportionate in its throughput use, we are working on optimizing it for the cloud specifically: GCE offers nodes with 3TB RAM and even beyond that. These 160-vCPU machines are far from being cheap, but performing the shuffle even for a huge model in 5-10 minutes using RAM disk is more than affordable (and these also come as 70%-discounted preemptible nodes).

The DNN training stage is the longest, and its disk use is relatively low: each 1-3 min. cycle a team of up to 20 GPU nodes produces a short burst of disk activity, then slowly streams cegs files through the computation. This is the main process where you do not want to undersize your shared NFS storage: the GPUs are quite expensive, and you want to keep them as busy as possible.

Lastly, the decode that you perform after training. This is another mostly disk-bound operation, at the least speaking of practical work (when you do worry about realtime performance and target 3-10× realtime decode).

If you allow me a digression from which you may or may not extract a practical tidbit, I also decode during training: the dynamics of validation set vs. test set in mismatching domains^✼The whole world seems to chiefly use cell-phones, while datasets of transcribed cell-phone speech are seriously wanting. A tricky augmentation and constructing and tuning the model for good extrapolation was my only viable option. is, from my practice, the main indicator of how good your augmentations and extrapolation are: if validation continues to improve, but decode WER rate dips and then grows, that the sure indication to stop training and go back to drawing board. I often start with dirty-fast converging training on a 1/10 of training data and shallower net, with the goal to overfit the model, and compare where the WER dips the on in-domain and out-of-domain sets. When the lines dip at the same epoch time, that’s the sign you hit an augmentation bull’s eye. This reinforces my point that data science has nothing to do with science, and should have been called for what it is, data witchcraft.

So, how does it all translate into practice?

Practical NFS options

For those who skipped the chapter: you oversize the disk for performance, and the machine vCPU count for its increased network and disk throughput. All vCPUs will be close to idle all the time, and the disk will be vastly empty.

First, a note on managing costs. A provisioned pd-standard disk costs around $40 per TB×month, use it or not. This is barely a noticeable charge if you experiment daily; if you put your work on hold, snapshot the disk and delete the cluster. (You should also snapshot the disk once in a while for backup purposes.) If you look at the pricing table, the snapshot seems not much cheaper than the disk, at $26 per TB×month, but this is incorrect reading: these are different TB. A provisioned disk is charged for its whole space, so you pay $80 for a 2TB disk it for all the time it exists, even if it’s empty. A snapshot, on the other hand, stores only the used space. Since we mount disks with the discard option, freed blocks are marked as free by a hardware command (the option was designed for SSD disks), so that GCE knows which blocks are used, and includes only them into the snapshot. Since your disk is always oversized, a 2TB disk may turn into a 200GB snapshot, if you clean out the egs directories before putting it into long term storage.

Second, the machine type of any VM can be freely changed while the machine is powered off, or TERMINATED. You can play with the size of the machine, or change it depending on your current job. On the other hand, a disk may be only extended, but may not be shrunk.

Point 1: If in doubt, start with a smaller disk, it’s too easy to extend, but hard to contract.

Point 2: Do not try to save money by undersizing the disk; you’ll lose more.

There is also an interesting technique that I call a sidecar disk, well documented by the very same guys who rent you the disk storage. In short, if you connect multiple disks to a VM, the total throughput is determined by their total capacity, and is enforced also on the total disk read/write traffic, not per disk. If you connect a 250GB disk and a sidecar 1750GB disk to a single machine, but use only the small one, it gets all the read/write performance equal that of a single 2000GB disk. The sidecar disk does not have to be mounted or even formatted: it just need to exist and be attached to the VM. And creating and deleting a disk takes 5-15 seconds… You got the point: one cost-saving measure is creating the sidecar disk only for the duration of a large computation, to temporarily beef up the NFS server. We will incorporate the technique in the future releases.

Point 1a: Use sidecar disk, even if running your cluster constantly, to assess whether extending the main disk is beneficial. Cheat sheet of gcloud commands will be provided as an appendix to this tutorial.

As for practical choices, think of the clusters in the T-shirt size scale: S, M, L and XL. Size S is only for learning, toying with and monkeying around: five T4 GPU max. Prices are for machine only, charged only while it is powered on. “Day” is a 24-hour interval for reference only: the bill is pro-rated per second of use.

Size S: Machine type e2-highmem-4 ($4.50/day), disk size 750GB.
Size M: Machine type e2-highmem-8 ($9/day), disk size 1250GB.
Size L: Machine type e2-highmem-8 or n1-custom-12-79872 ($13/day), disk size 2TB
Size XL: Machine type n1-highmem-32 ($23/day, but you don’t care), disk size 4TB. Custom extended memory may be helpful.

For learning Kaldi, start with size S. For any sensible work, start with M, upgrade to L if you see to much throttling. If your shoulder width asks for the size XL, let me know what is that thing which you are computing: I never needed used this much, and am genuinely wondering.

And, since you still have that cluster definition file open in the editor, edit in the machine type and the disk size.

Sizing the login node

In the ideal world, the login node would only send commands to Slurm for computations, and would need the bare minimum of CPU and RAM. But Kaldi script sometimes do work on the node they run. This includes, for example, composing the L and G machines, so there is an occasional computational demand. e2-standard-2 is a decent choice.

Make sure that the scripts no not compose the full HCLG graph on the login node. Use the $cmd to allocate a whole compute node for this work. These steps use a lot of memory. Switch to e2-standard-4 or e2-highmem-4 otherwise; these come with larger RAM.

Since e2-standard-2 is already the default, we are done. That was quick.

The control node

The control node runs the cluster control software. BurrMill uses Slurm, a software once designed to run supercomputing clusters in LLNL, and now owned and developed by SchedMD (it’s open source and GPL; the company lives off its support contracts). We are not using Slurm in a traditional way HPC is done.

In a supercomputer, it is common to run jobs on multiple communicating nodes for days or weeks on end. Jobs are scheduled in advance, and may wait in the queue for weeks in a high-load environment. Some jobs consist of multiple steps, and steps may have non-trivial dependencies. Networking is complicated; usually there are separate data and control networks, and the topologies must be accounted for on multiple levels, from the job’s tasks being close together on the network to placing tasks on NUMA nodes within the CPU package and across multiple CPUs sharing a board so that they are as close as possible to their RAM. (And you thought the NP-hard knapsack problem was hard…) Security and accounting are very strict, since a single datacenter-sized cluster is shared by multiple users and institutions.

Our requirements to a cluster controller are almost directly opposite to these. Jobs are short, non-communicating and independent. Load changes very quickly, and nodes come and go on demand. Networking is flat (GCE hides the real complexity from us), and we do not use security or accounting, as this is also well handled by GCE. The mighty scheduling capabilities of Slurm are barely unused. Still, Slurm handles this heels-over-head scenario with a remarkable precision. In fact, SchedMD added a few cloud-specific capabilities to Slurm v18 and expanded them in v19. We apply a few lines long patch to Slurm source to reduce certain hardcoded timeouts; everything else works out if the box. You do not need much details about Slurm; read documentation if you want to understand it deeper. The SchedMD site contains all of it. You may use sinfo, squeue, scontrol and scancel once in a while; do not miss the tools/bin/swatch script which displays all interesting info dynamically using the watch command. Kaldi uses sbatch indirectly via cmd=slurm.pl. We install Slurm man pages, so you may refer to these.

The slurm.pl and an example configuration file are currently located in ./tools/kaldi/. We’ll integrate it into the master branch of Kaldi, but currently, while BurrMill is in beta, you’ll need to copy it to your experiment directory from there.

The control node is not configured. One size fits the whole range of BurrMill clusters.

Compute nodes

Remember that GCE vCPU is a hyperthread, or ½ of a core. GCE guarantees that 2 consecutively numbered vCPU belong to the same physical CPU core. When counting resources, we use vCPU to refer to the GCE’s resource, but use CPU to refer to a physical CPU. Slurm is aware of this through its topology discovery, and schedules jobs per real CPU. CPU count = ½ vCPU count.

This part of the cluster definition file is dedicated to compute node definitions, under the nodes: key. Currently, you need two different kind of nodes: CPU-only and GPU nodes. Since GCE added the whole zoo of new CPU types only recently, I’ve used the single custom-shaped compute node with 10 vCPU (5 CPU cores) and only 12.5 GB of RAM. Large RAM tasks (making the HCLG graph) requested nodes exclusively; most other tasks comfortably fit 5 per this node: Kaldi is very mild on memory use. Also, there is up to 5 simultaneous non-GPU jobs launched by train.py during the GPU training between iterations, which holds a single, optimally used CPU node up for the whole duration of training.

The same is currently doable with the second generation N2 machines, but the C2 platform does not allow custom configurations.

These days, the computing CPU of choice are the C2 and N2. We did not yet fully evaluate them, but they do show less runtime on the same tests than the N1 Skylake. The N2 supports custom configuration just like the N1. Unfortunately, the high-performance C2 machine types do not come in custom shapes, only in a small set of predefined standard configurations. It seems^✼In theory, based on node shape only, assuming that performance is not affected by size. We need to experiment with real Kaldi computation as benchmarks, because all C2 machines are only partial physical CPUs, and larger nodes may exhibit better performance due to better cache coherency. that the c2-standard-4 (4 vCPU = 2 CPU, 16GB) is the optimal choice: it can be allocated whole to RAM-demanding tasks such as constructing the HCLG graph, and the 5 non-GPU tasks will keep 3 nodes alive during the GPU phase, with only half a node wasted.

In Slurm, nodes may belong to partitions, which are somewhat like queues in other cluster controllers, like SGE. The relation is optional, and is, generally, many-to-many. Partitions are very handy for identical nodes, as we do not care which node runs a given task, only want the node be of a specific configuration. Since we use only 2 node types, we allocate each type into its own partition. On a batch submission, slurm.pl specifies a partition. The default cluster template defines a “standard” partition called std. Also, there are 3 types of predefined node, our old 5 CPU, 15GB N1 Skylake-based, the same configuration but based on the newer N2 platform, and more atomized 2 CPU, 16GB C2. If you optimize cost, go with N2, if time, then C2 is a better choice. Note that experiments should be configured according to the node shape. Read comments in tools/kaldi/example.slurm.pl.conf to understand the implication on the get_egs stage; also, multithreaded i-vector extractor training should also be planned differently.

Seasoned hint: Sometimes GCE runs out of resources. There were days when I was unable to compute: the preemptible nodes could not start, or killed in troves. This usually resolves in a few hours. Define two CPU-only partitions: the preferred, and another as the “plan B,” and keep two variants of the slurm.pl.conf file. It takes one change in cmd.sh to direct CPU jobs into a different partition in case the preferred platform type is in short supply. You can also use the GRES technique, used for the GPUs, described below.

The second partition is for the GPU. Usually, I keep all GPU nodes (T4, P100 and V100, although I do not use the latter) in one single GPU partition. Slurm has a concept of generic resource, or GRES in Slurm-speak. A node defines what types of GRES it provides, and how many of each, and a job declares what types and quantity GRES it needs. Slurm makes sure that job’s GRES requirements are satisfied. A GRES full form is name:type:count; for job’s request, short forms are also acceptable: name:count, name:type and name. There is no ambiguity, as the resource name and type must start with a letter, and the count is a positive integer. The type is optional on node definition if the resource name does not require further qualification. The type omitted in a request stands for any type, and the omitted count defaults to 1.

We call our GPU resources by name cuda, not gpu. This is done because not all generic resources are equally generic. When a resource is declared, Slurm tries to find a plugin with the matching name; and it comes with the gres_gpu.so plugin. While beneficial in sharing physical multi-GPU nodes, the plugin only complicates things for us: it requires /dev/nvidiaN devices listed in an additional file gres.conf, and sets the environment of the job^✼This is a massive oversimplification of what actually happens. Slurm runs jobs in cgroups, like Docker does, so that other /dev/nvidiaN devices simply do not exist for it. Also, GPU scheduling is much more complex and accounts for internal machine topology. Ignore this; we do not even use the gpu GRES. I’m only disclosing that this time the picture is dumbed down, due to the tremendous complexity of Slurm scheduling. with CUDA_VISIBLE_DEVICES set to allocated GPUs.

Absent a plugin, GRES are simple identifiers of something countable (Slurm does not care), which node provides and a job wants; no altering of the job environment or special setup is done. We use cuda:p100, cuda:t4 etc. to specialize this resource, and specify the type in the configuration read by slurm.pl. If you use only one type of GPU in the cluster, you do not need the GRES specification (partition is enough), but if you use GRES, always request a specific type, otherwise you risk Slurm dispensing you a mix of different GPU types, such that the expensive faster ones will have to wait until the slower ones complete their shards.

Note on comment-like // prefixes: these are not YAML comments. We process them specially, so you can “comment out” a whole node template by placing this prefix in fornt of a single line. YAML comments, like shell or Python, start with the # and continue to EOL, so that if our mechanism would not exist, you’d have to comment the whole block, which is error prone. To YAML, this is just a part of key in a key-value pair. if you are familiar with JSON (familiar as in have seen it at least once), look what happens when this YAML is converted to JSON:

$ echo "
> foo:
>   answer: 42
> //bar:
>   question:
> " | y2j
{
  "foo": {
    "answer": 42
  },
  "//bar": {
    "question": null
  }
}

//bar is just a normal key. BurrMill tools know about them, and do not output the corresponding node configuration.

This is the only place in the cluster template where the //-construct is recognized.

A node template looks like this:

p100:
  Count: 24
  CoresPerSocket: 2
  RealMemory: 5960
  Gres: cuda:p100:1
  GCE:
    machine-type: n1-custom-4-6144
    accelerator: type=nvidia-tesla-p100,count=1

Count: is required. Define as many nodes as you asked for in the quota request. All nodes are started on demand only, so you pay nothing for them until they are created.
CoresPerSocket: the number of CPUs in a node (vCPU count divided by 2).
RealMemory: In MB. Somewhat less than the total memory. Subtract 200M to be in the ballpark. slurmd -C prints the exact value end exits when run as an interactive command on the compute node; subtract 10-20MB for the reported value, to allow a bit of slack in case a next version of the image reports less available memory.
Gres: Just discussed. Make sure that the GPU type and count are true for the cuda GRES, and that no_consume is set on the gcecpu GRES (Slurm counts CPUs, of course, so you do not want to interfere).
Other keys: Any value available in slurm.conf for node specification can be added. This is more like a future extension, as currently there is only the Weight parameter that makes sense to set. This is an advanced feature.

Under the GCE subkey, any switch accepted by gcloud compute instances create is in principle recognized, but only the three used in the file probably make sense:

machine-type: required. E.g.: n1-custom-4-6144 for custom N1 with 4 vCPU and 6144MB RAM. RAM must be divisible by 256M. E.g.: c2-standard-4 for a machine is predefined configuration.
- Only the N1 and N2 support custom configuration.
- Only the N1 can be used on GPU nodes.
- Custom machine types with irregular shape are said to be less prone to preemption, according to the GCE support’s lore. I do not have enough data to confirm or deny this.
min-cpu-platform: Intel Skylake
- Use only for the N1 node type. The generation 2 machines N2, C2 are based on Cascade Like core, already beyond Skylake.
- Required for N1-based CPU compute nodes: they greatly benefit from AVX-512 instruction set available starting with the Skylake platform.
- Do not set for GPU nodes, which have to be N1.
accelerator: type=nvidia-tesla-TYPE,count=1: Set for GPU nodes (this is what makes them GPU nodes, in the end). TYPE is one of p100, t4, v100. Remember that V100 do not provide a good cost/benefit ratio, but ≈25% faster.

A lot more switches are added in ./lib/imaging/layouts/compute/usr/local/sbin/slurm_resume.sh.

Hint: to verify that the machine-type name is well-formed and available in zone, use the gcloud command like:

# c2-standard-4 is a correct name and available in us-central1-a
$ gcloud compute machine-types describe --zone=us-central1-a c2-standard-4
creationTimestamp: '1969-12-31T16:00:00.000-08:00'
description: 'Compute Optimized: 4 vCPUs, 16 GB RAM'
. . .

# c2-standard-4 is unavailable in us-west1-a
$ gcloud compute machine-types describe --zone=us-west1-a c2-standard-4
ERROR: (gcloud.compute.machine-types.describe) Could not fetch resource:
The resource 'projects/YOUR_PROJECT/zones/us-west1-a/machineTypes/c2-standard-4' was not found

# n1-custom-4-6144 is a valid configuration
$ gcloud compute machine-types describe --zone=us-central1-a n1-custom-4-6144
description: Custom created machine type.
guestCpus: 4
. . .

# n1-custom-4-6000 is an invalid configuration
$ gcloud compute machine-types describe --zone=us-central1-a n1-custom-4-6000
ERROR: (gcloud.compute.machine-types.describe) Could not fetch resource:
Invalid resource usage: 'Memory should be a multiple of 256MiB, while 6000MiB is requested'.

# C2 does not support custom configurations.
$ gcloud compute machine-types describe --zone=us-central1-a c2-custom-4-6144
ERROR: (gcloud.compute.machine-types.describe) Could not fetch resource:
The resource 'projects/YOUR_PROJECT/zones/us-central1-a/machineTypes/c2-custom-4-6144' was not found

A similar command to confirm that a GPU type is available:

# T4 is available in us-central1-a.
$ gcloud compute accelerator-types describe --zone=us-central1-a nvidia-tesla-t4
creationTimestamp: '1969-12-31T16:00:00.000-08:00'
description: NVIDIA Tesla T4
. . .

# P100 is unavailable in us-central1-a.
$ gcloud compute accelerator-types describe --zone=us-central1-a nvidia-tesla-p100
ERROR: (gcloud.compute.accelerator-types.describe) Could not fetch resource:
The resource 'projects/YOUR_PROJECT/zones/us-central1-a/acceleratorTypes/nvidia-tesla-p100' was not found

# Z987 is not a valid GPU type. The error message is same.
$ gcloud compute accelerator-types describe --zone=us-central1-a nvidia-tesla-z987
ERROR: (gcloud.compute.accelerator-types.describe) Could not fetch resource:
The resource 'projects/YOUR_PROJECT/zones/us-central1-a/acceleratorTypes/nvidia-tesla-z987' was not found

The last section, partitions: defines Slurm partitions:

std is for CPU nodes;
gpu is for GPU nodes.

If the key Nodes: is missing in a partition named X, then Nodes: [ X ] is assumed, i.e. partition name includes only nodes with the same type. Defaults are good, and should not change for your first cluster. Note: it is not an error if the array Nodes: includes a configuration which is “commented out” using the //, so you can easily try thing without messing with the file too much. It is an error, however, if the configuration found in the array is not defined at all. Also, all node name templates must be unique even if “commented out.”

Run the YAML file by y2j to make sure it is syntactically correct.

Commit your cluster definition

This is an important step. The bm-deploy tool will refuse to write uncommitted Slurm configuration (you can override this, but better do not). Do not push it yet, though: you can easily amend your last commit, and Git will assign a different SHA1 hash to it. The typical recipe for defining the cluster configuration looks like:

Write the cluster definition file.
Commit your changes.
Deploy cluster.
Verify that Slurm is happy, and that it can start and talk to compute nodes.
If everything looks good, do git push. You are done. Consider Git tags to visibly mark commits.
If anything didn’t do as desired, power down cluster.
Edit the definition file.
Amend your latest commit (git commit -a --amend --no-edit).
Reconfigure Slurm only (bm-deploy config -s).
Power on the cluster, and return to step 4.

We’re done with (1) now. Let’s do (2). We’re still using the name qe as your cluster name.

# Tip: you can make --short the default if you prefer compact statuses.
# Alternatively, (I do) define a Git alias 's' for 'status --short'
# It's a good idea to make sure that the cluster file is the only change:
# keep your commits small and targeted.
$ git status --short   # I type 'git s' for this!
master...origin/master
?? etc/cluster/qe.cluster.yaml
$ git add -v etc       # -v shows added files.
add 'etc/cluster/qe.cluster.yaml'
$ git commit -m "Add definition for cluster 'qe'"
[master 0cc405d] Add definition for cluster 'qe'
1 file changed, 139 insertions(+)
create mode 100644 etc/cluster/qe.cluster.yaml

We’re getting very close to that “press the start button” moment. In fact, we already there… The last section guides you hands on through deployment and validation of your cluster.