Set up BurrMill Infrastructure

This is the fourth part of the BurrMill 101 crash course. You will learn what is resource quota and how to request it, build all software, and create a base image with all necessary library and tool packages preinstalled, so that a new instance may be created in under a minute. There is still a lot to grok in a short time, but this knowledge will pay off sooner than you think. Still, we offer a few shortcuts of do-now-understand-later kind, if you would feel exhausted.

A fully configured BurrMill GCP project consists of a multitude of supporting services, and one or more computing clusters. The supporting infrastructure is required for building code, rolling out OS images that allow computing nodes to be created and destroyed in a matter of seconds, storage for configuration and software components, and optional private Git repositories. Within this infrastructure, you deploy computational clusters, in which you perform experiments.

Did you complete the homework, by the way? It’s due right now!

You will complete the following tasks in this walkthrough, and get some superficial, yet not incorrect, understanding of the GCP:

  • Create the project.
  • Check out BurrMill tools from its Git repository.
  • Run the initial BurrMill setup script.
  • Calculate required resources for quota.
  • Request computing resource quota via the Cloud Console.
  • Modify BurrMill control files as necessary.
  • Build all software and disk images.
  • Check in your infrastructure configuration into Git repository.
  • Explore a few useful GCP commands.

1. Create the project

All command-line commands, be it gcloud, gsutil, git or BurrMill tools, are exactly the same whether you work from the Cloud Shell or your own home or office Linux machine. Cloud Shell has all the necessary BurrMill prerequisite tools; we check if your machine has them before configuring your project.

First of all, you need a GCP project. And remember, we also want to get rid of that automatically created project that everyone gets, want it or not, when signing up. One command line for each, no biggie (put your chosen project name in place of the PROJECT_NAME placeholder). --name is a human-readable string, displayed in the Cloud Console Web UI. You can select any readable name you want, with spaces, punctuation, in any language; this is not an identifier.

$ gcloud projects list
 PROJECT_ID           NAME                   PROJECT_NUMBER
 upbeat-bolt-271409   My First Project       1176168941010
$ gcloud -q projects delete upbeat-bolt-271409  # Copy name from above.
 ... snip lots of babble ...
$ gcloud alpha projects create --name='BurrMill' --no-enable-cloud-apis --set-as-default PROJECT_NAME
 ... If you get an error telling that the name has been taken, get creative once more and choose a different PROJECT_NAME ...

Everything in GCP except GS is done by using the gcloud command, and GS is controlled with a separate gsutil command. Bash tab-completion should work with it; if it does not, you are probably missing a Bash tab-completion package (bash-completion in Debian/Ubuntu).

Learn a few things about the gcloud command:

  • General form of the command is, not very formally, “gcloud [subsystem] noun verb [arg] switches”. Global switches (like the -q we used above to suppress a confirmation) can go anywhere, but command-specific switches go anywhere after the verb part.
  • The noun represents a GCP resource (a project, VM, disk, identity account, storage bucket are all examples of resources). A majority of nouns support CRUD (create, read, update, delete) semantics. The “read” part consists of two verbs: list to list resources roughly one per line, and describe to print detailed information about a single resource. create, update and delete are corresponding verbs for the rest.
  • A resource operation is a GCP API call, where the optional subsystem and noun define the API subsets, and the verb the API call: gcloud compute disks delete tempdisk translates (conceptually) into an API call compute_api.disks.delete("tempdisk").
  • Special forms gcloud alpha and gcloud beta invoke, correspondingly, alpha and beta versions of the API named by the noun part. Some alpha APIs are public, other are invitation-only. Beta API are all public. The versions that are neither beta nor alpha are termed GA for “generally available.” In the last command above, we used the alpha version, because it has the argument --no-enable-cloud-apis, not present in the GA version.
  • Every command has a help page:
    • gcloud subsystem --help.
    • gcloud [subsystem] noun --help.
    • gcloud [subsystem] noun verb --help.
    • gcloud --help or gcloud help.
    • gcloud info for the SDK information.
    • gcloud topic for additional man chapters.

We highly discourage creating resources that are normally part of BurrMill infrastructure or clusters using the gcloud X create Y command. The names of resources may be special, or they may be labeled in a way expected by other tools, or even be managed using GCP Deployment Manager, which is quite picky about prying into its business, and has a short temper. Deleting or updating resources has much fewer unintended consequences.

2. Checkout BurrMill repository

This is a simple part, if you know a bit of Git.

The remote branch that you clone is read-only, but there is one directory, ./etc/ (from now on, we’ll use the dot . to indicate BurrMill root) which is entirely yours. Some files may be added between releases, but we never overwrite or modify it. You’ll tune a thing or two, and check-in your settings. We’ll do that later. Currently, we only need to prepare the project.

# In a directory that will contain BurrMill root (your home directory
# is fine, especially in the Cloud Shell environment).
$ git clone -o golden -b master git://github.com/burrmill/burrmill 
Cloning into 'burrmill'...
Receiving objects: 100%, done.
$ cd burrmill 
$ ls
... bin etc lib libexec maint tools

For future commands, we assume your current directory stays as we set it above: the same as ./, at the top level of the BurrMill repository worktree.

What we just did was:

  • initialize a new repository called burrmill under the current directory;
  • create a remote alias golden (with -o golden) for the repository git://github.com/burrmill/burrmill. It is a common practice to name the branch from which you only get updates “golden” or “upstream.”
  • fetched the remote repository using the read-only git:// protocol, limiting the objects to those on the master branch (-b master) to reduce clutter.

Now add the ./bin directory for the repository to the PATH. You’ll need it both right now, and later upon every login, so make the change in both your .profile file (or what it is for your shell) using the full path to the burrmill/bin directory (as shown by echo $PWD/bin), and for the current session:

$ PATH=$PWD/bin:$PATH

3. Initialize your new project

Run the interactive command

$ bm-update-project

and follow the prompts. The initial setup may take up to 30 minutes. This same command can be run later to fix the global configuration in a non-destructive way. If you’ve modified the project in such a way that you lost track of where you are, this is the command to run to show and fix inconsistencies. A repair or update run will take no more than a couple minutes; it’s only the initial setup that takes a long time.

All BurrMill tools are named with the prefix bm-*.

The storage buckets created during initialization are billable per GB×month. They are now empty and free of charge, but for a fully-booted project, you should expect the administrative storage costs about $0.75±0.25 per month (software, container images, OS images), provided you follow our recommended purge practices of keeping 2 latest disk images of both kinds.
This does not include (a) dataset and model storage and (b) snapshot storage for NFS disk backups, which are highly dependent on disk size (for reference, my storage costs—and my datasets and disk backups are far from being small—are just over $6/months in total). For full details, look in the Standard (first) column of the cost table; generally, it varies from $0.02 to $0.026 per GB×month. In the location dropdown on that page, ignore dual-zones labeled “zone1 and zone2”: these are special, and used for extra-high-availability realtime GCP applications.

Also, the shared machine key is stored as a secret which is billed at $0.06 a month. I promised to disclose all billable resources, so here you go.

4. Calculate cluster resource quotas

Calculating and requesting the quota is probably the most complex and hard to get right interaction with the GCP. The way the Web UI is organized is also not very helpful. You are up to the hardest task to perform on the GCP.

To compute necessary quota, you need your good old pencil and a calculator. We are planning to add a tool to make this task easier, but right now it’s old school. Also, I’ll have to jump ahead of myself and explain what the computing cluster consists of, on the highest level only necessary to understand now what resources you need to request. Remember that getting the quota costs you nothing.

A point to remember: preemptible and non-preemptible GPU and CPU are requested separately. You may use non-preemptible quota for preemptible use, but not the other way around. Also, non-preemtible CPUs of special types (e.g., C2) need separate quotas, but preemptible CPUs do not and fall under the single Preemptible CPUs quota, so you can use preemptible C2, N2 or N1 CPUs all the same. This makes the process a bit easier for us here.

First, compute per-cluster requirements. A cluster consists of:

  1. A shared NFSYou probably know that, but just in case: this stands for Network File System and allows all nodes to see the same files on the shared disk. We are using NFS version 4.1 read-write disk that you run experiments on. Usually it is a pd-standard, magnetic disk. New projects get a 4TB quota in every region; this is good enough for a start. Even if you plan huge computations or two clusters in the same zone, try to go with just one first; you may request additional magnetic storage later. This resource is plentifully available.
  2. Three non-preemptible control machines: sum up all the vCPU requirements. By default, you have 24 vCPU per region, which is likely enough:
    1. The NFS server is sized according to the shared R/W disk size. For a 4TB disk, put down 10 vCPUs; for a 2TB, 6.
    2. The Slurm controller. This is a small, e2-highcpu-2 machine, with 2 vCPU.
    3. The login node. If your experiment is planned with the cluster in mind, you offload tasks such as building HCLG graphs to compute nodes. If you just take a Kaldi example an run it as is, some scripts do the heavy-lifting work right in the script. This is a waste of computing power and money, so better modify them to use the script referred to as $cmd. Still, 4 vCPU is probably a necessary minimum.
  3. Preemptible GPU nodes. I usually run one GPU per node, as Kaldi training does not involve a GPU-to-GPU communication, and atomizing the computation minimizes the impact of a node loss. The GPU node needs 4 vCPUs. Next, find the maximum nj you are going for in training. For example, your model is large and goes all the way up to 18 GPUs. You need a larger quota, so that a new node can be spawned even before the old one has been cannibalized for parts by the GCE. Also, you might want to train it for another epoch, and go all the way up to 21. Add 3 to the maximum number to allow a cushion when restarting a lost node. Thus, you’ll need a quota of 24 of the type you want to experiment with this model. The vCPUs for GPU nodes are widely available N1. Multiply the number of GPUs you are requesting by 4 to get the number of vCPUs. Since you are not going to run different GPU types at the same time, multiply each separately and take the largest. For example, if you want 16 T4 GPUs and 24 P100 GPUs, take 24, the largest number, and get 24×4=96 preemptible vCPUs.
  4. Preemptible CPU-only machines. It is helpful that, unlike non-preemptible vCPUs, these all come under one quota regardless of CPU type, “Preemptible CPUs”, so you do not need to request e.g. C2 CPUs for the preemptible nodes specifically. For my own large computations, I use 48 nodes, 10 vCPU each. You can put down 480 preemptible vCPUs right away, but note that C2 configuration is different; still, the number is quite sensible to have: I cannot provide a solid number, but I think we are in a ballpark.
  5. Do not plan ahead for the egs mixer node yet. It still in the works, and won’t arrive until beta 0.7. You’ll get extra required CPUs for it later, it’s simple enough. We do not know yet if a 80-vCPU or 160-vCPU machine is the best for the task.
  6. Add the vCPUs from steps 3 through 5 above. You’ll end up with 480+96+0=576; put 600 into the request. C’mon, don’t be scared by the numbers: you’re at Google. Peanuts.
  7. pd-ssd disk storage.
    1. We are using 10GB SSD disk to boot every node, both control and preemptible compute node. Get enough storage to boot all nodes at once. Add the number of non-GPU and GPU nodes you computed,. That is, if you have 24 GPU and 48 non-GPU nodes, you need 3+24+48=75 boot disks, or 750GB. Add a bit of extra, 50 to 100GB. We’re at 800GB.
    2. Common software disk. This is a disk that contains Slurm, Kaldi and everything else you want (SRILM, SCTK) which is shared by all machines in R/O mode. This is an SSD disk, and it’s size varies from 50GB in a small cluster to 150GB in a large one (we’ll get to the reason why the size matters for a 4/5 empty disk in the next section). So, for a large cluster, you want 950GB of SSD storage. Round it up to 1TB for a good measure.

By the end of this exercise (remember: this is an example, although realistic), we’ve got, per-cluster, taking the maximums, 14 non-preemptible vCPU (and you already have 24, you are good), up to 4TB of magnetic ps-standard storage (you’ve got exactly this from the start, ok); and the things that you do not have and must request:

  1. 600 preemptible vCPUs (default: 0)
  2. 1000GB of pd-ssd storage (default: 500GB)
  3. 16 preemptible T4 GPU (default: 1)
  4. 24 preemptible P100 GPU (default: 1).

If you are planning for two clusters right away, or if you had to split your test between regions because of varying GPU availability, repeat the calculation and write down the result. All quotas are allocated per-region…

…except two: A new project catch. A newly established project also has an additional per-project quota for two resources: total counts of CPU and the GPU, respectively, preemptible or not, independent of their type. After a couple months it will be lifted, but for now you have to request it. So if you are requesting the above quota in a specific region, you will also need to add to the same request the global quotas. The number defines how many devices can be globally active at the same time in your whole project. I’m putting ‘or’ below, depending on whether you requesting the above quota in one or in two regions and planning to run both clusters at the same time. If you are not running active computation in both, you do not need to sum them up.

  1. Global GPU in all regions: 24 or 48 (default: 0)
  2. Global CPU in all regions: 600 or 1200 (default: 32)

So this is as much as you know about the quota. As you spend more time money on GCP, many initially limited resources will disappear from the quota table, and other quotas will just grow without your requesting them.

You do not need to apply for a quota for any other resource; their use, even if limited, stays a few orders below the level when a quota would be required. There are build, storage, maintenance functions and many other moving parts that you do not need to care about.

5. Request the quota

This is probably the ugliest interface in the whole GCP, so have patience and bear with it. You’ll have to scroll through a table with items placed in random order and without any search capabilities. The browser’s search may be of help to find items in the long, poorly sorted table.

Let’s take care of global quotas first. Go to https://console.cloud.google.com/iam-admin/quotas?service=compute.googleapis.com&location=GLOBAL. This is not an “official” shortcut, so make sure it preselected the correct filters for you. Check the picture below, and make sure that “Compute Engine API” and “Global” are selected. If not, select them by first by clicking on “None” in the drop-downs to reset them to empty, then scrolling and selecting the required item.

Next, go to the bottom of page and set the number of items displayed per page to 200, so you will have all items available at once (there are currently 106), so you can use the browser’s search facility without jumping between pages.

Make sure your project name (the human-readable one, BurrMill by default unless you chose a different name in gcloud projects create above) is shown above these drop-downs on the blue header. It should be, but if not, select it.

Now click on the “Edit quotas” button. The right part of the screen will change to a box where you will accumulate all quotas for the request. Find and check two items in the leftmost column, small type under a larger type, repeating “Compute API” heading: “CPU (All Regions)” and “GPU (All Regions).” On the right, a form will appear asking for your name, phone and e-mail. (e-mail and name will be auto-filled; add your phone numberNobody will call you, really. and click on Next there.

Next, click on the “Location” drop-down, and click on the region you are requesting resources in. Leave the “Global” item checked, so that both “Global” and your region are selected. Sort the table by its “Location” heading. Remember than in each region you are requesting preemptible CPUs and SSD persistent disk (always), and between 1 and 3 different preemptible GPU types (T4, P100, V100). Now be careful what you are requesting: under each item, I am listing the similarly labeled but different resources.

  1. “Preemptible CPUs”.
    BUT NOT:
    • CPUs.
    • {N2, C2, N2D} CPUs.
    • Committed CPUs.
    • Committed {N2, C2, N2D} CPUs.
  2. “Persistent Disk SSD (GB)”.
    BUT NOT:
    • Local SSD (GB).
    • Committed local SSD disk reserved (GB).
    • Preemptible Local SSD (GB).
  3. For each GPU type which you plan to use, X ∈ {T4, P100, V100}, “Preemptible NVIDIA X GPUs”.
    BUT NOT:
    • NVIDIA X GPUs.
    • NVIDIA X Virtual Workstation GPUs.
    • Preemptible NVIDIA X Virtual Workstation GPUs.
    • Committed NVIDIA X GPUs.
  4. Optionally, if you want magnetic storage in excess of the default 4TB in this region: “Persistent Disk Standard (GB)”.
    BUT NOT:
    • Persistent Disk SSD (GB).

Now, go to the right half of screen. Each checked quota will have a field where you should enter an earlier calculated value. Fill them with the numbers. In the “Request reason,” put a one-sentence description why you need the resources, for example, “Training speech recognition models in a cluster, computed quotas per https://100d.space/p/burrmill/486#5”. If you are going to request resources in another zone, add “Sending a request for more capacity in another region separately” (to explain why your global resources do not compute and are larger than the regional resources you are requesting). Submit the request.

If you are requesting resources in another region, rinse and repeat. I recommend you send a request per-region, because if any of the resources are in shortage, your whole request will be rejected. For the same reason, include the global resources in each of the requests: this will save you time if you put them into only one request, and that request would be rejected.

Each of your requests will be confirmed with an automatic e-mail. Double-check the table in this e-mail for errors. The e-mail means that the request is received, not that you got the quota; you should wait for another message for the quota allocation.

The requests that you are sending are for relatively small amount on the GCP scale, and it’s unlikely they could be rejected, except for one case: the T4 GPUs are very popular, and shortages are common. If you don’t get them, but don’t mind using the P100’s instead (and they are available in the same region), reply to the rejection e-mail asking to drop the T4s and fulfill the rest. These emails go to the same team that handles the quota requests sent through the table. Thus you’ll avoid the error-prone process of composing a new request. Then, if the P100 already were in the same request already, do nothing and wait, otherwise submit a separate request for the P100s only for the same region. The 24-48 hours turnaround cited by Google rarely gets up to 48; usually you get a reply much faster than in 24, oftentimes the same work day.

If you fail get a certain quota, look at the resource table again, and find other suitable zones. Reply to the email (they go to real people), and ask if they can fulfill the same request in one of these regions where your chosen zone is, in order of your preference. They’ll likely just fulfill the request without any further communication.

6. Tune BurrMill and build software

Our goal now is to build all required software and give you only a very basic understanding of the build process. We’ll try to make everything as simple as it can be, but not simpler. The CNS disk preparation is the most complex and extensible process in administering BurrMill, and more details will follow as you progress. Please bear with us, we’re almost there!

Current BurrMill beta 0.5 comes with a bare minimum software to run most recipes: Kaldi, obviously, with all of its hefty dependencies; Slurm to control the cluster; and a recipe for building the SRILM toolkit (you need to license the source separately). Our goal here and now is to get your rig churning. We’ll go through extending build by adding more software laterThis is a roundabout way to say that I did not write any documentation on it yet, and “later” is a very specific word that precisely means “at some very non-deterministic time in future, which is very unlikely tomorrow”.. If you need a specific software added now, and it blocks you from beta-testing, please open a ticket. The source should be either openly available or otherwise accessible to us. If it’s not, let’s hop on a Slack channel. We want the system beta-tested as good and as widely as possible, so that’s a favor for us, not a nuisance.

Set up your Git repository

BurrMill is a type of software that is known as “infrastructure as code,” and code requires a good source control. In fact, some steps will refuse to work unless overridden with a --force switch, if they detect uncommitted control files. What you have now is a distribution, which you may update to a new version by running git pull. Of all directories, one, ./etc/, is dedicated to your files which define your clusters, set software versions (especially important for Kaldi itself, which is a very fast-paced project), specify additional software you may put onto the common node software (CNS) disk, or additional packages you want installed into the base OS image.

When you run bm-update-project, files from ./lib/skel are copied to your ./etc/ directory (unless already present; we never overwrite your files).

Ideally, you should never modify files outside of that directoryIf you are contributing to the project, set up a repository separate from your “production” repository that controls your infrastructure where you run the experiments. For your “production” repository, , the maintenance is confined to the ./etc/ directory, unless you found a bug or had to tweak script for a scenario we do not otherwise support (in this case please open an issue, whether you want to contribute a code or just explain what issues/workarounds you had to add). Since you’ve already ran bm-update-project, certain files have been placed in your ./etc/, and Git thinks that they must be added before your repository is squeaky clean (type git status while your current directory is inside it).

Now, a Git repository is just a set of files on your disk, and can disappear if the disk fails, or you delete it by mistake, or you do not log in to your Cloud Shell for 120 days (your disk is erased if you abandon it). You should be safekeeping it better, and for Git, this usually means you push your changes to a remote repository service that takes care of safekeeping your precious code for you. GitHub is one option, if you want your setup publicly visible. GCP also has a Git repository service, that you may keep private to yourself. You probably know how to work with GitHub, GitLab or Bitbucket. Currently, to get you introduced to the GCP Git repo service, we’ll do it the GCP way.

It’s not easy to make a GCP repository public, and if you do, you’ll pay for other user’s access. If you want to open your code, use one of the public services. You can republish it later if you want. But there is little added value in your setup tweaks by design: we try to make the common part of BurrMill working for everyone, and your ./etc/ setup is unlikely to work for someone else, so GCP-hosted repo may really be the best choice for you.

GCP repositories are free up to 5 user×months, then a $1/month charge applied over this free limit. If you are working alone, you can have 5 free repos in all your projects combined. If you have Alice added as a user to your single project, and both you and her used (pulled from or pushed into) each of your 3 repos during a billing month, that’s 2×3=6 user×months, and you are billed 6−5=$1 for the usage over the free tier that month. If you have 7 projects, each has 1 repo, and you did use every one of them within a month, you’ll be charged 7−5=$2 that month. It’s the owning users, not their owned projects that count.

The advantage of GCP Git repos is that you do not need to set up SSH keys to access them: the account authorized to use gcloud can be used for a transparent HTTPS access to them. First, see how the remotes are set up now:

$ git remote -v
 golden  git://github.com/burrmill/burrmill (fetch)
 golden  git://github.com/burrmill/burrmill (push)

But git:// is a read-only protocol, and you won’t be able to push anything at allAnd if you would, I’d become very angry at you! The repo is called golden for a reason!, obviously. What you want is your own remote repository, writable by you, where you would push your changes and pull them back from. We need to create a repository in your project (let’s call it burrmill), add it as another remote, and tell Git how to authenticate with it.

Still remember that gcloud command has a describe verb for almost everything?

# Create a new repository.
$ gcloud source repos create burrmill
Created [burrmill].
WARNING: You may be billed for this repository...<snip>

# Figure out the URL you can use with Git.
$ gcloud source repos describe burrmill
name: projects/YOUR_PROJECT/repos/burrmill
url: https://source.developers.google.com/p/YOUR_PROJECT/r/burrmill

# Add an alias for the new repository. Call it 'origin' and assign
# it the URL printed right above.
$ git remote add origin FULL_URL_PRINTED_ABOVE

# Register gcloud as a helper with Git. '--global' adds the helper
# to your ~/.gitconfig file, so you'll do it only once. The '!' at
# the start tells Git that what follows is an external command. All
# GCP Git repos are hosted at 'https://source.developers.google.com'. 
$ git config --global credential.'https://source.developers.google.com'.helper '!gcloud auth git-helper --ignore-unknown'

# Push the repo from your disk to GCP.
$ git push -u origin HEAD
 . . . .
Branch 'master' set up to track remote branch 'master' from 'origin'.

HEAD, as you probably remember, is a special symbolic alias for the current branch in Git. The last message says that on this computer (this bit of configuration is stored in the local clone ./.git of the repository) git push without arguments will by default push your staged changes for the master branch to the repo you just named with the alias origin; this is due to the effect of the -u switch. The same applies to git pull: to pull and merge BurrMill updates into your master you’ll have to tell Git explicitly to do so, with git pull golden HEAD.

Note that you did not stage or commit your ./etc/ files yet, so that your GCP repository is currently an exact copy of the one you pulled from GitHub.

You need to run bm-update-project after you pull any new revision of BurrMill files. It takes only a minute or two to run, unlike the lengthy initial setup.

Hints for efficient Gitting elsewhere

  • If you work from more than one machine, after your GCP repo is created, you should clone it using git clone defaults, to label your repo URL as origin:
# Figure out the URL you can use with Git.
$ gcloud source repos describe burrmill
name: projects/YOUR_PROJECT/repos/burrmill
url: https://source.developers.google.com/p/YOUR_PROJECT/r/burrmill

# Clone your repository locally. Since the URL ends in 'burrmill', Git will
# create the local copy in a subdirectory 'burrmill' under current directory.
$ git clone FULL_URL_PRINTED_ABOVE
  • Or, with a bit of gcloud-fu, a single-liner will do it. You’ll probably guess how it worksThe full --format manual is printed with gcloud topic format, but you’ll drown in. Just guess..
$ git clone $(gcloud source repos describe burrmill --format='get(url)')
  • Both commands require that the Git credential helper be set up in the “global” Git configutarion, i. e. ~/.gitconfig. It’s a good idea to copy this file on every machine you work from, because it sets required Git parameters (your name and e-mail for commits), and many useful preferences.
  • You do not need to set up the alias golden if it doesn’t exist. Git understands the repository URL directly within the command line. To, to upgrade BurrMill later (which is not something you’ll do often) you can simply pull the code using its “golden” URL on GitHub:
$ git pull git://github.com/burrmill/burrmill HEAD
From git://github.com/burrmill/burrmill
branch            HEAD       -> FETCH_HEAD
Already up to date.
Current branch master is up to date.

Build Common Node Software (CNS) Disk

Every BurrMill node shares a common read-only disk. This is a cloud-only feature, not possible in a physical cluster. There is only one provisioned copy in a cluster at a time, and all machines where the disk is mounted share the read bandwidth of this disk. Now, one fact that will be helpful to remember is that in GCE disk bandwidth is proportional to disk size: there is a specific number in MB/s per GB of disk capacity (0.48 for an SSD, to be specific). Thus, a 100GB SSD has the read and write bandwidth of 100×0.48=48 MB/s. All nodes share this bandwidth, so the more nodes are churning your data, the larger a disk you need.

The software build process uses GCP Cloud Build (GCB) service. The reason we do that is because all software is built and then later run under exactly the same OS. After the build, we package the resulting binaries, called artifacts, in two possible ways:

  • The libraries that are used at both build-time and run-time (CUDA, MKL) are stored as DockerAll highlighted terms in this paragraph are Docker, not GCP terms images in the project’s Docker registry. Docker supports image tags , which indicate a version of the image.
  • The software that is only needed at run-time (Kaldi, Slurm, etc.) is packaged into tarballs and stored in a GS bucket, namely, the Software bucket of your project. We attach version metadata to this tarball.

You do not need any knowledge of either Docker or GS. We’ll go through all the details later, in a chapter on extending the CNS build process later.

After all artifacts are built, or later, when a piece of software is rebuilt, a snapshot image of a version of CNS disk is assembled from artifacts. Both build and assembly are performed with the bm-node-software tool, and controlled by two files named Millfile: the main one in ./lib/build/ is loaded first, and the override file for your customizations in ./etc/build/ second. Both files have the same format and are well-commented, although your override file initially comes nearly empty. In this initial walkthrough, we’ll tread lightly.

Pin Kaldi version

Copy commit ID

This is required. Kaldi is a very dynamic project, and you want your experiments reproducible. Look at the version history, and select a commit not far from the latest. It sometimes happens that a couple of very recent commits create unexpected behavior; look at a few recent commits and use your judgment: if recent changes are significant, the best point to select may be one or two weeks back, the timeframe in which recently introduced bugs are usually caught. Click on the Copy commit ID icon, edit ./etc/build/Millfile and paste it into the ver kaldi line (it’s near the end). The copied ID is a full 40-character Git hash. Remove all but the first 9 hex digits. It’s critical that you count them right. What you should get in the end, using the example commit ID from the picture, is a string that is longer by 2 digits than the GitHub interface shows:

ver kaldi 53c4bd19f

Optional: add SRILM if you use it

BurrMill includes build script for SRILM as an optional example. The build is disabled by default. If you are using it within a Kaldi recipe, check that the version in the Millfile is correct, upload source to the Software bucket, and enable the build.

Get the source tarball from the SRILM licensing and download page if you qualify for a research license, or get a commercial license otherwise. Make sure the file includes the version suffix, for example, srilm-1.7.3.tar.gz.

Upload the tarball to GS, into the location where build expects to find it. gsutil autocompletion can be very helpful for locating the bucket; you can also type gsutil ls to list all buckets in your project. The one you are looking for, the Software bucket, created with the project, is the one starting with gs://software- prefix. Upload the tarball into the sources directory:

$ gsutil cp srilm-1.7.3.tar.gz gs://software-YOUR_PROJECT-RANDOM_SUFFIX/software/

Then, in your Millfile, find the line starting with tar srilm and make sure the SRILM version is correct. Also, comment out the line skip srilm using the comment prefix #, which disables the build of the artifact. This line is not commented-out by default (you can also delete it if your setup is using SRILM permanently). The lines should look like this:

tar srilm  1.7.3  _SRILM_VER : cxx
... a few lines down ...
# skip srilm

The first directive requires a bit of explanation, so that you understand what you are doing, at least at an overview level. Millfile is modeled after Makefile, as you have already guessed. The tar directive means that the build artifact is a tar.gz file that will be uploaded to the Software bucket (another possible artifact type is image). srilm is the directory name where instructions for building the artifact are found; the ./etc/build/srilm is looked at first, and the files are in fact there, but were they not, the ./lib/build/srilm location would have been used. 1.7.3 is the version, and _SRILM_VER is a variable name that the version is assigned to and passed to the build instructions. the cxx is a dependency, just like in a Makefile; this is a builder artifact which is not installed to the CNS disk, but must be build first anyway. It contains all packages required to compile C/C++ code. We’ll get into advanced details of the build and assembly process later.

Build The CNS Disk

The CNS Disk build and assembly is done by the bm-node-software tool. All tool commands except bm-update-project support --help (or short -h) switch. The build subcommand evaluates dependencies, build only out-of-date artifacts (just like make would), and then runs the assembly stage, where the actual disk is created, stored as a snapshot, and then deleted. The snapshot is a disk backup image that can be instantiated as the disk later. The snapshots, and the CNS disk instances, are named burrmill-cns-vNNN-YYMMDD, with the version NNN starting at 001 and progressing up by 1 with each new build, and the YYMMDD part is the snapshot assembly date, just for convenience.

Run the command

$ bm-node-software build

The command runs as many builds as their dependencies allow in parallel. Still, the complete process of the initial build and disk assembly takes 35 to 40 minutes.

Even the full rebuild of all software except Kaldi consumes less than half of your daily free allotment of 120 build minutes. Kaldi is a much larger piece of software. Other software, such as SRILM or Slurm, takes about 15 minutes each to build on a default 1-vCPU machine (only 1-vCPU build hosts are included in the 120 free minute free tier), but we request a 32-vCPU build machine for Kaldi, and it also takes about 15 minutes. On a smaller machine, building Kaldi would be simply impossible.

After the build completes successfully, the assembly phase begins. At the start, you will be shown a summary table of the artifacts for the new disk; we call this summary the manifest:

+——————————————————————————————————————————————————————————————————————————————————————————————————————+
| NAME  | VERSION   | TYPE  | ARTIFACT LOCATION                                                        |
|——————————————————————————————————————————————————————————————————————————————————————————————————————|
| cuda  | 10.1.2    | image | us.gcr.io/YOUR_PROJECT/cuda:10.1.2                                       |
| kaldi | 53c4bd19f | gs    | gs://software-YOUR_PROJECT-nnnnn/tarballs/kaldi.tar.gz#1585169650524797  |
| mkl   | 2019.5    | image | us.gcr.io/YOUR_PROJECT/mkl:2019.5                                        |
| slurm | 19.05.4-1 | gs    | gs://software-YOUR_PROJECT-nnnnn/tarballs /slurm.tar.gz#1580168467411160 |
| srilm | 1.7.3     | gs    | gs://software-YOUR_PROJECT-nnnnn/tarballs /srilm.tar.gz#1579874839920642 |
+——————————————————————————————————————————————————————————————————————————————————————————————————————+

To create the CNS disk snapshot we invoke Google’s own imaging software Daisy, although in this case it’s a little bit of subversion. Daisy works by allocating GCE resources, starting and stopping temporary VMs according to its controlling workflow, and saves the final resultWe use Daisy in this case on the verge of subversion; it does not even have a capability of snapshotting a disk: it’s main purpose is building bootable OS images. The snapshot is saved from the temporary VM by its build script.. The script will ask you to download the latest Daisy version if it’s not already in the ./bin directory.

At the end of assembly you are shown the summary of all snapshots and disks, and their manifests.

The recommendation to prune disk snapshots is only a recommendation. Their storage is cheap. It depends on your pattern of use: so you want to return to the exact environment of old experiments, or you are good with running them with updated Kaldi/MKL/CUDA stack? bm-node-software rmsnap subcommand can remove specific snapshot that you never used, if you prefer a strategy of not pruning to the second-latest one.

Always record manifests into your experiment logbook to ensure you can always return to the entirely identical software configuration and trace unexpected changes, should you suddenly see them. We keep assembly logs in GS for 120 days only.

Some resources used in the disk creation are billable. A full Kaldi build on the 32-CPU build machine costs in the range $0.90 to $1.00. The bill for producing the image snapshot is insignificant (<$0.01). A single image snapshot of the BurrMill stock size, without additional software, takes about 1.5G of GS storage, which rakes you up about $0.04/month.

A provisioned SSD disk, on the other hand, is more expensive, $0.17 to $0.22 per GB×monthAs a rule of thumb, regions with a larger selection of GPU and CPU—those that are most suitable for our work—are gravitating toward lower costs., so a provisioned, ready to go 100GB SSD disk for a cluster in the L–XL range costs $17–20/month to keep, whether you use it or not. We never keep more than one provisioned shared CNS disk per cluster, and share it between clusters in the same zoneWhile cost-saving, it may not be the best strategy if you are an advanced user and run a few (3+) clusters in the same zone at the same time. If you do, provision a larger disk to maintain the expected performance: you certainly do not want to overpay for idling GPUs because the large Kaldi binaries with their CUDA and MKL libraries take an extra few seconds to load! by default. Remember that disks are zonal resources.

Prepare and build OS image

Unlike the CNS disk build and assembly, OS image build is a pretty simple and straightforward process, so let’s dive deeper into it.

If you feel overloaded by this point, and just want to complete this walkthrough ASAP, feel free to return to this point later. You can leave most settings as they are, except one: skip down to Image Customization Phase 2 and make sure to set your timezone.

Role of the OS image

All cluster machines start their life from the single OS image. A custom image is built from the base Debian 10 image provided by Google (which is, in turn, is a Debian 10 image customized by Google by preinstalling GCE-specific services). On top of it, we install all necessary services (such as Slurm and NFS), so that a node can be booted quickly. The BurrMill image is prepared as a “stem cell” that when a new VM is booted from it: a filer node for the NFS server, a control node that runs Slurm controller which creates and deletes compute nodes as necessary and assigns units of work to them, a login node, to which you connect and start the experiments, or a preemptible compute node. The image also has NVIDIA drivers installed, so that a compute node can run both CPU and GPU computations.

The separation of software into the CNS disk, which is shared by all nodes, and a boot disk, which is created from the image individually for each node, ensures that all nodes share exactly same software, and are also disposable: with a possible exception of the login node, you do not ever update packages on nodes. Instead, you simply rebuild the filer and control nodes after building a new version of our base image. Also, software on the CNS disk normally changes more often and at random times (Kaldi upgrades, mainly), while image updates are mostly tied to an OS servicing schedule. All useful information is stored on the shared NFS disk, and possibly on the login node if you work on it. It’s not even possible to update filer and control nodes, because they have no Internet connectivity (nor do ephemeral compute nodes); this allows for a pretty lax policy on upgrading them (you do not need to install security upgrades that patch an immediate threat, for example). Normally you do not even need to login interactively to these nodes unless something goes wrong; they are fully automated and self-sustained.

Whether to rebuild or update the login node is up to you. This is the only node that has direct access to the Internet (but is not accessible from the Internet). I normally use it as a work machine, upgrade it, and rebuild only when necessary (if the image changes significantly), saving and then restoring my home directory in a GS bucket. This is another benefit of separating node’s OS and software. I’ll describe the technique in a later post.

Image build process

All files controlling the image build process are collected from the directories ./lib/imaging and ./etc/imaging/ and uploaded to the Scratch GS bucket.

The image is built in two phases. At phase 1, a disk freshly created from the base Debian system is attached as a non-boot disk of a temporary VM, and the script ./lib/imaging/scripts/1-bootstrap-layout.sh is run with root privileges inside this VM. All packaged files are downloaded to the new disk into the /.bootstrap directory. Next, some of the downloaded packages, called layouts, are simply extracted to the root of the future image, one by one. Finally, an optional script or other program may be executed (we do not use this currently, but you may supply one). At this point, the temporary VM is shut down and discarded.

At phase 2, a new VM is booted with the disk prepared at phase 1 as its boot disk (this VM is booted with only one disk), and another script, taken from ./lib/imaging/scripts/2-prep_deb10compute.sh, is run. The script installs packages required to run both cluster controls and Kaldi experiments, both shared libraries and tools, and NVidia drivers. Next, certain users and groups required to run the cluster are created, the machine configuration is modified to remove unneeded services, mounts for the CNS disk and the shared NFS disk are added. Lastly, all unnecessary logs are erased, the machine identity is reset, bootstrap files removed, and the VM of the phase 2 is shutdown.

Lastly, the GCE image is created from the disk, and the disk itself is discarded.

Image Phase 1 Customization

Every layout is originally a subdirectory of ./lib/imaging/layouts and ./etc/imaging/layouts, such that each subdirectory is transferred to the root of the future image. First, layouts from the subdirectoriesThere are two subdirectories: common and compute. This is a historical artifact from the time before IAP Tunnels arrived, when we had to build another machine type, a hardened SSH gateway. We still keep the separation, so it’s easy enough to add another type of image. are copied in alphabetical order. Next, layouts from the subdirectories of ./etc/imaging/layouts are placed (if there are any), also in the alphabetical order of directory names. Your own layout can be used to put files into the /etc/skel directory, which is copied to every user’s home directory on first login, and thus preload your .bashrc, .vimrc, .gitconfig or any other files that youBut do not even try that if you is plural: $ n>1$ un*x users have at least $ n+1$ disjoint sets of preferences, and $ \mathcal{O}(n^2)$ the only correct ways of applying them. want customized. Another practical use for your own layout is adding non-packaged software, usually installer under /usr/local.

As mentioned already, an optional program may be run at the end of the phase 1. The script is taken from ./etc/imaging/addons/1_user_post_layout, run with the root identity, and can be written in any language preloaded into the OS (which is Ubuntu 18.04, not Debian 10, for certain technical reasons beyond this discussion): bash v4.40, perl v2.26, python3 v3.6, or GNU awk v4.1.

Image Phase 2 Customization

Edit the file ./etc/imaging/addons/user_vars.inc.sh, which currently supplies two customization variables to the phase 2 script:

  • USER_TIMEZONE: set this to your timezone. If you leave it unset, all nodes will use the UTC timezone. Unless you are based in Iceland, this is probably not what you want.
  • USER_APT_PACKAGES. Add package names that you want to install to the base image in addition to the default packages. If you build and add software to the CNS disk, make sure to add its runtime dependency libraries. This is a bash array variable; add package names as single words separated by spaces or newlines between parentheses, e.g., USER_APT_PACKAGES=(libfile-path-tiny-perl libset-tiny-perl python3-levenshtein)

Currently, as of the initial v0.5-beta, there is no support for tools installing Python packages or Perl modules, like pip or cpan. Contact us if this is a requirement for you. It’s an easy to add feature; I just did not feel competent enough in the ways people commonly use them, and, at the very least, avoided doing it horribly wrong.

Building the Image

At this point, you are ready to go. If not working from the Cloud Shell, start tmux or screenDaisy takes care not to leave behind any temporary resources, such as VMs or disks, even if it fails or interrupted with Ctrl+C. Although it also handles the SIGHUP signal on a connection loss, systemd may limit the time allowed for it to finish the cleanup., and run the command

$ bm-os-image build

The build takes under 10 minutes. You will mostly see progress messages from Daisy, followed by a summary of all images in the burrmill-compute . The family is just a common name for a group of images, which are considered successive updates of the same image; nothing technically special about them. All images built by BurrMill tools belong to that family

The naming scheme is same as that of CNS disk snapshots: burrmill-compute-vNNN-YYMMDD

Image storage is billed, but the rate not quite certain, as the pricing table confusingly does not show actual storage locations (us); but it appears to be under $0.10 per GB×month. Only the archive size of the image counts (not the disk size of 10GB, shown in Cloud Console); the BurrMill image built with out-of-the-box packages shows archive size about 0.6GB. In any case, you are not up to a sticker shock; looking at the billing query right now, I raked up $1.48 in over a year for my custom images in the us multiregion, which varied in quantity from 1 to 4 at different times.

The image build process is also billable, but insignificant (≈$0.01).

WHEW! You are done with the major part of the supporting infrastructure. You can deploy a cluster and crunch your models.

But wait, don’t rush it. You probably don’t want your precious starting configuration to disappear.

7. Commit and Push Your Changes

Since your Git repo is already set up, it takes just a few Git commands to commit the files in ./etc/ (they are not even under source control yet) and push your first commit.

# Stage added files in the etc/ directory.
$ git add etc

# Commit staged changes into your local repository.
$ git commit -m 'Configure GCP project to run BurrMill'
[master  2590e29] Configure GCP project to run BurrMill
11 files changed, 398 insertions(+)
create mode 100644 etc/build/Millfile
 . . .
create mode 100644 etc/imaging/addons/user_vars.inc.sh
 
# Push the new commit to your GCP remote
$ git push
 . . .
To https://source.developers.google.com/p/YOUR_PROJECT/r/burrmill
   9ca784e..2590e29  master -> master

Now you are really done with the project configuration part!


In the next part of the course, we’ll confer superpowers on your old friend SSH.

Leave a Reply