GCP Economic Geography 101

In this section we’ll go through the high-level structure of the Google Cloud Platform (GCP), understand the relevant parts of its pricing model, and select the location for your BurrMill rig. This section is one of a few densely loaded with information which you need to absorb. The end goal of this is to select where in the world you are going to compute. Next, you can request the quota and complete your setup while waiting for a response.

Space is big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist’s, but that’s just peanuts to space.
  — Douglas Adams, in The Hitchhiker’s Guide to the Galaxy

GCP can be overwhelming. No, that’s not strong enough a statement: GCP is overwhelming. It has a good detailed documentation, but it would amount to a few thousand pages if it were printed. There are tutorials and hands-on walktroughs, many hundreds of them. And your head is full of preconceptions about physical hardware, which are amazingly unhelpful when applied to cloud machines.

Dissecting GCP

There are so many angles here in the 100D space from which you can look at things. Let’s start with a simple but easy to understand analogy.

GCP as one of Google Services

The 9-hole-button button menu

You probably have a Google Account (if not, create one right away, we’ll need it). A Google Account identifies you to multiple loosely coupled services: you can attach a file from Drive to an e-mail you write in GMail, Drive is where all Google Docs are really stored, and so on and on. When you click on that 9-hole-button button, you see at the least some of them.

Some of these services contain their own, lesser services that are also coupled to one another, and may use other Google services. For example, Google Docs has forms and spreadsheets, and each form entry is recorded into the spreadsheet; a document can have a picture embedded from Google Photos, which is “one level up” from the document (Docs and Photos are peers).

Both of these examples have analogues in GCP. We’ll simplify the picture without dumbing it down. First, GCP uses the same Google Account to identify you. You go to drive.google.com, or mail.google.com, or cloud.google.com, and are recognized there as the same you. Second, inside this major service are others, just like documents and spreadsheet in the Docs.

In this perspective, the major differences between GCP and user services are:

  • The relative scale of services is much wider in GCP. If you score the document as the simplest service of the Docs, and the spreadsheet as the most complex, that’s just peanuts to GCP. Some services store and retrieve a few KB of key-value configuration data, while others, such as Compute Engine, have dozens of distinct entities and contain their own container hierarchies.
  • The number of services that GCP offers is at the level where quantitative difference becomes quantitative. A quick check of my BurrMill project has just shown that I use 21 out of 280 available services.
  • You can have multiple projects inside the GCP. Unlike owning at most one GMail, or exactly one Drive, everything that is alive and kicking in GCP is contained inside some project.

Note that the above is not quite technically correct; it’s rather an operational analogy. Don’t hold onto it, discard after use, wash your hands carefully.

GCP at the top level

This is the simplest thing of all: there are only two entities at the top level of GCP:

  1. Projects, containers for all GCP machinery.
  2. Billing accounts so that you can pay for everything that this machinery crunches.

A project may have zero to one billing accounts (but is not much useful without one). A billing account can be linked to zero or multiple GCP projects.

Currently, when you open a billing account by signing up for free trial, Google posts a credit of US $300 on it that you must spend in 12 months.

When you open an account for a free trial, Google promises to not bill you; your services will simply stop working if you spend the trial allotment. This trial account is marked as such. When you upgrade a billing account, neither the free money left nor the time remaining to spend it disappears. I highly recommend to do it right away, because the Free Trial is very limited; one restriction is you cannot use the GPUs, and the GPUs are likely the main reason you are reading this.

I’ll mention explicitly when and for what you will be billed. Everything else is free, either by design or at our scale.

Otherwise, a billing account is not a very interesting entity. From now on, let’s focus on projects (and remember, you can have many projects).

Quotas

How many projects is many? The concept of the quota is important to understand. There are many caps on number and the rate of use of resources in GCP. Some parameters are uncapped, while some cannot be increased at all. There are limits that look bizarre, such as no more than 15,000 of VMs running per network, while there are other ones we may exceed, such as the limit on creating no more than 10 VMs per second (fortunately, this limit is calculated within a sliding window of 100 seconds, because launching some 50 nodes at the same time when a decoding job starts after GPU training is pretty routine).

The remaining limits are artificially capped, and increased by the support based on user requests. These requests are usually approved (I was denied only once, requesting 24 of T4 GPUs in the most crowded location at the time the T4s were in top demand). Starting quotas on new accounts are fairly low, and approval takes 1-2 days, so it’s important to request enough quota in advance, while we are preparing other things. We’ll get to the exact resources required a bit later. Just keep in mind you’ll have to calculate and request the resource quotas.

GCP Services

A GCP project is a passive container that holds together a few pieces of information which passively are, such as codename, numeric identifier, human-readable name, security accounts. Meanwhile, everything that happens is controlled by services.

Pay attention to a couple of misnomers.

First, the words service and service API, or even just API are mean the same thing in the documentation; I’ll stick to the word service as the least ambiguous term.

Second, a GCP project may contain at most one Google Compute Engine (GCE) project, and at most one AppEngine project (the latter we don’t use directly). Meaning of the term project thus depends on context, not only in the docs but even in CLI commands. For example, the command gcloud projects describe foo prints information about the GCP project named “foo”, but gcloud --project=foo compute project-info describe describes the GCE project within the GCP project named “foo”. Please absorb that and keep these different “projects” separate in your mind.

When you enable a service in a GCP project, this service’s API becomes functional. Most services allow nearly equivalent requests performed from the command line tool gcloud, and from the Cloud Console, a Web UI at console.cloud.google.com (No, I didn’t forget to make it a hyperlink. Please resist the urge to check it out yet). gcloud is useful in shell scripting, the Cloud Console is sometimes useful for one-off tasks. For the third way to access a service functionality, namely the API calls, Google provides client libraries for multiple languages. In fact, both gcloud and Cloud Console also call the same service APIs to perform your commands.

Locality in GCP

There are little services, like Git repository hosting, configuration storage, and huge services, like the GCE, with all its networks, subnetworks and virtual machines, disk, bootable disk images and disk snapshots. We must look a little bit closer at the two most important services: the GCE, and the Cloud Storage, a.k.a. GS (yes, it’s a G, not C), or just Storage. Both of them have a concept of location, and both, summed up, contribute the largest cost to your experiments. Since you want to deploy a computing cluster, interact with its console, at least using a text terminal, and get back the models, it’s better to understand where resources are located for each of these services, and what the implications are. But first, let’s take a closer look at another billable resource, namely network traffic. Normally, networking costs are negligible for Kaldi experimentation (you’ll see why later), but you need to understand when they are not.

Cloud Storage

Storage is a fairly simple service. It is designed to store and provide access to named files of bulk data at a small cost. The service holds data in buckets, and the files, called objects in Storage lingo, look very much like files a hierarchical filesystem.

Storage does not provide much of the rich semantics usually expected of a filesystem, such as locks or overwriting a range of bytes; you would not be wrong if you think of it as a remote filesystem you would access using FTP or similar protocols and tools. A separate tool gsutil from the SDK is used to transfer or delete files and list contents of buckets. All buckets share a global namespace in GCP, and each bucket must have a unique name, like software-bmtest-e51 in the example below.

$ gsutil ls -lR gs://software-bmtest-e51/
gs://software-bmtest-e51/debian/:
    868224  2020-01-29T07:14:02Z  gs://software-bmtest-e51/debian/git-man_2.24.0-1ubuntu1_all.deb
   4504200  2020-01-29T07:14:04Z  gs://software-bmtest-e51/debian/git_2.24.0-1ubuntu1_amd64.deb

gs://software-bmtest-e51/hostkey/:
       165  2020-01-16T09:31:50Z  gs://software-bmtest-e51/hostkey/ssh_host_ed25519_key.pub

gs://software-bmtest-e51/sources/:
  27852464  2020-01-29T07:09:03Z  gs://software-bmtest-e51/sources/srilm-1.7.1.tar.gz
  65701090  2020-01-01T06:35:50Z  gs://software-bmtest-e51/sources/srilm-1.7.3.tar.gz

gs://software-bmtest-e51/tarballs/:
 560279888  2019-12-26T21:08:20Z  gs://software-bmtest-e51/tarballs/cuda-10.1.tar.gz
4811091105  2019-12-26T21:11:27Z  gs://software-bmtest-e51/tarballs/kaldi-190702-gab4eca0c2-190816.tar.gz
 692627688  2020-01-24T14:38:29Z  gs://software-bmtest-e51/tarballs/kaldi.tar.gz
 224233048  2019-12-26T21:12:56Z  gs://software-bmtest-e51/tarballs/mkl-2019.2-057.tar.gz
  . . . .

Convention: we’ll set the commands in examples a walkthroughs in bold type, except for the part of a command that you should substitute with another string, which varies from project to project or depends on your choices. These are set in italic type, as above. In the command output, important lines will be set in bold. Commands are always prefixed with $ .

Naturally, Cloud Storage is also an underlying storage mechanism for many other services, such as disk snapshots, Docker container registries and so on.

Buckets can be located in two kindsThere is a third type, dual-region, for high-availability production applications, but it’s of no interest to us. of locations: region or multiregion:

  • A region roughly corresponds to a data center in a single location; regions are always located at least 100 miles (160 km) apart from each other. For example, the region us-west2 is located in Los Angeles, and asia-east1 in Changhua County, Taiwan. There are currently 24 regions worldwide.
  • A multiregion encompasses multiple regions, roughly corresponding to a continent. There are 3 multiregions: us, eu and asia. A few regions, for example australia-southeast1, do not belong to any multiregion.

Storage of data in a multiregion is just a little bit more expensive (at most 20% for the standard storage classOther storage classes exist, which are charged less for storage at the cost of data read fees and minimum data immutability promise: 30, 90 or 365 days, respectively. Some BurrMill rules automatically demote older versions of software builds into some of these classes, before permanently purging them. We pretend it saves you money (typically, by whopping 20 cents a month), but the real reason it was another interesting toy to play with.”), but has the advantage of free data flow into and out of any service in the same multiregion. Latency for a multiregional storage may also be higher. Generally, multiregional storage is the best choice, unless the amount of data is so large that the price difference (of about $6/(TB×month) more in us than in us-west1, the largest difference of all) becomes a factor, or data is streamed directly from buckets to a pipeline. You may think that this is a good idea, because it does not consume space on the cluster’s shared NFS storage, but there are other reasons, discussed later, that make it irrelevant.

Regional storage prices vary by location, but not significantly for the standard class (from 20 to 26), and multiregional prices are the same everywhere. From my practice, for the amounts of data we usually work with, charges for storage are too low to warrant shopping for a best location. Services provided are also exactly the same everywhere. Thus, we can discount GS locality when selecting the best location to run batch jobs. The situation is not this simple with GCE, as we’ll see next.

Google Compute Engine

GCE is likely the most complex of all GCP services. Luckily, we do not need all of its features. There are many features that are designed for online processing, such as Web sites: HTTP Load balancers, autoscaling groups of identical machines that grow and shrink with the demand, and many others. The type of work we do is a purely a batch load, which can stop and wait and then continue without affecting the end result.

However, Kaldi training, as other HPC work, relies on support from specific hardware: we need GPUs and the fastest CPUs equipped with the AVX-512 fused multiply-add (FMA) units. Google does not offer the full smorgasbord for all hardware options at all of its locations, and this is where we are up to the task of choosing the location for our work wisely. In addition, prices also vary by location.

GCE has three levels of locality, and they are finer grained than those in GS:

  1. Global. This is as simple as it sounds: the thing is equally accessible everywhere. Examples of global resources are virtual networks, firewall rules and machine boot imagesIt is notable that when you create a boot disk in a zone where it has not been used, it may take 20-30 seconds, but subsequent operations create the same disk from the same image in about 2 seconds. This time was surprisingly stable: I ran a script creating and deleting disk 4 times a minute for 24 hours, and the numers were so tight, without a single outlier, that I had no reason even to take its dispersion..
  2. Regions. These correspond to GS regions (exactly or nearly so). There aren’t so many regional resources that are of use to us; regional scale is usually for large websites with the machinery to spread service across zones. AFAIK, Google even requires customers to have at least 2 active zones to sign off a service level agreement—i.e., when Google pays penalties if client could not run their load because Google—but this is well above our pay grade. Examples of region names are us-west1 and asia-northeast3.
  3. Zone is a subunit of a region. No GS-like quirks here: every zone belongs to exactly one region, and zones are named systematically after their regions with a dash and a letter: us-west1-b and asia-northeast3-a are real zone names. Currently Google has at least there more zones per region. Most of the resource we use are zonal: disks and VMs are prime examples.

There are a couple of objects that belong to other services, like GCE, but reflect the underlying GS hierarchy. For example, disk snapshots, a sophisticated incremental backup storage for disks, can be either regional or multiregional—just because of the way they are stored.

A peculiar, but quite expected fact is, hardware availability differs per-zone. Some zones have a given type of GPU. and others in the same region do not. Assembling a Frankenstein cluster spanning a few zones in not allowed by the rules of our game: although communication latencies between zones in a region is in units of ms and often even below 1ms, this is still much higher latency than in-zone networking, which runs at the very least as fast as a 10Gbps LAN. Second, in-zone networking is free, but inter-zone traffic incurs charges. Therefore, when selecting the home for your cluster, consider only zones that have everything you need.

Let’s look at the options interesting to our Kaldi work that are offered in some location or another.

GPU

GPU is the most expensive hardware unit to rent. Also, the use of GPU dominates your bill for sizable models (from ≈50% for mini_librispeech to ≈90% on a 6K hour training data). Of all the GPU models offered, only 3 really fit the bill for Kaldi training. You are unlikely to need all 3, but using 2 for different jobs makes sense. All accelerators are NVidia Tesla, so I’ll refer to them by the model code. Quoted prices are for 1 hour of preemptible use, in the US$, and do not vary by location. Also, when comparing, I am using only the actual train.py iteration (Kaldi uses 32-bit floats for training), not the numbers from the table.

  • T4. Turing TU104 GPU and GDDR6 memory, the same combination as RTX2080 (not Ti). The card’s design is optimized not for overall speed, but for the best FLOP/W ratio. It’s close in its performance to the 1080Ti, if only slightly slower overall. The benefit is its price, which has been recently dropped to $0.11 for a preemptible use. Too bad our jobs do not parallelize wellMy to-go solution when varying the number of parallel jobs in chain models is xent_output. Start with decreasing it approximately as $ n^{1/2}$ when increasing parallelism by $ n$, then tune. YMMV., but for jobs up to 100 or maybe even 200 hour of speech, if you are not in a hurry, it’s a very economical choice. When requesting a quota, be aware that these are in high demand, given their bang for the buck. The availability seems to be improving over time.
  • P100. The flagship GPU in the Pascal series, GP100, and 16G of HBM3 GRAM. (Cf. 1080Ti, using a very similar GP102 and GDDR6 GRAM). Google uses the PCIe variant (not the NVLink-capable SXM). The HBM3 memory speed does make a noticeable difference. This has been my workhorse for larger loads. While it has the price tag of nearly 4×T4 ($0.43), it does not perform 4× faster; it’s in fact closer to 2× on larger loads. But don’t forget that you are paying for other CPUs and RAM, and larger loads require an upsized NFS server. So it’s hard to tell, really. Even if you are thrifty and the main parameter you optimize is cost, always consider the complete cost of everything. Your cheaper cost may vanish, and all you get in return is the same model at the same cost but trained twice as longer. Maybe I’ll run a comparison of optimally configured clusters to compare these two accelerators on a 100-hour set, but I cannot promise to do that soon. Designing a comparison methodology is not simple. Of course, your contribution is welcome.
  • V100. This is an SXM2 boardSXM2 is a form-factor and a pin-out of the board, which does not look like a PCIe board at all. SXM2 is a precondition for NVLink functionality: NVLink implies SXM2, but an SXM2 GPU does not necessarily support NVLink. based on the Volta V100 GPU with HBM3 memory, same as P100. Given its NVLink Gen. 2 support it can communicate very fast in a group of up to 8 of the same GPUs in the same server—but this is not what Kaldi is currently capable of. This is truly a powerhouse board, but it’s cost/benefit on Kaldi pure 32 IEEE floats comes out worse when compared to P100. It computes our loads at about 1.35 the speed of the P100, but its price of $0.74 is in excess of 1.72 of the same. Go for it only if time is of more concern than money.
  • P4, P80. You don’t want ’em.

GPU prices are the same in all locations, as has been already noted. T4 is the widest available accelerator, but for production work with larger models, start with the P100, and only then check if T4 is sensible, or that you’re in such a hurry consider if the faster V100 is worth the expense.

Note that not all combinations of VM sizes and the number of GPU per node are supported, and some VM types do not accept GPU at all.
Also, keep in mind that the virtual operation of “attaching” a GPU to the machine physically translates into the GCP’s searching for an available physical host that has the physical GPU board plugged into it permanently. The tighter your constrains, the more likely your VM is not to start, or get preempted too often.

CPU

So you felt that choosing the right GPU was hard. Okay. Now hold on tight as we walk through the zoo of different CPUs available on the GCE. Rules of the game:

  1. All machine types can be configured as preemptible.
  2. Only N1 and N2 machines can use GPU.
  3. One vCPU is a hyperthread, 1/2 of the real core. I usually allocate one job per CPU, not per hyperthread. GCE guarantees that all vCPU make sequential pairs of physical cores, e.g. 4-vCPU machine has 2 cores, with vCPU 0 and 1 on the first core and 2 and 3 on the second. Also, cores in 64-vCPU or less Intel-based machines share the NUMA node and a virtual “socket.”
  4. Standard machine types are identified as typeconfigurationnumcpu, e. g. n1-highcpu-4. The word highcpu means low RAM per CPU. The 3 ubiquitous configurations words here are highcpu, standard and highem, in the ascending order of their RAM/vCPU ratio, unchanging within the type (but different for different types). You’d spot other designators once in a while. numcpu=1 machines have only half a core available, and aren’t good candidates for a compute node.
  5. Low-power fractional vCPU shared core machines do not have the numcpu suffix. These are f1-micro (¼ vCPU), g1-small (½ VCPU), e2-micro (¼ vCPU), e2-small (½ vCPU) and e2-medium (1 vCPU). All fractional vCPU machines support short bursts up to a full hyperthread for a limited time. You can comfortably read logs and prepare experiments on a g1-small, e2-small or e2-medium machine, but they are not suitable for training or evaluation of models. My own machine of choice for low-power work had been g1-small before the E2 series arrived; now it’s the e2-medium, only a few percent more expensive, but twice as fast.
  6. You may build custom configuration with how many vCPU and RAM you want and what the machine type allows, but not all machine types may be custom-sized. C2 is a notable, and unfortunate exception. These CPUs have the best performance for matrix operations.
  7. For some machine types (N1 is of the utmost concern), you also specify a minimum CPU platform. You need Skylake or better for N1 to take advantage of their AVX-512 FMA units on CPU nodes, but not on GPU nodes. Specifying the CPU architecture does not change the machine cost, only the likelihood that your requested machine type won’t be available as preemptible, or preempt too often. Request the lowest reasonable CPU for the job.

And the specific machine types:

  • N1. Historically the oldest machine type, and the most ubiquitous one. I’ve been using it the most, because when I started working on GCE Kaldi training, this type was basically the only one available. With the minimum CPU platform of Skylake, they performed well, but not as fast as my Skylake-X desktop, and very much so: about 50% of its speed. Split your steps into more nj in Kaldi; I used up to 500 vCPU running at once for decode jobs. Also, some time ago, Skylake was not available in every zone; currently it is in all 67 zones in the world.
    • N1 is the only type allowing a GPU.
    • When using N1 with the GPU, do not select the minimum CPU platform: this will not speed up computation, only make the node more preemption-prone.
  • E2. A new line of machines that may come with any type of CPU, outside of your control, but at a certain discount. They are the best choice for a small to large NFS server limited to 16GBps throughput (I’ll explain why in a technical post later). E2 machines are currently available in every zone, and come in predefined shapes only: all three variants listed under the point (4) above for 2 to 16 vCPU, plus those listed under the point (5).
  • M1-megamem-96 and M1-ultramem-{40,80,160}. These are essentially N1 machines, and were previously classified as N1, and may even be conjured up by their old name, s/m1/n1/. They’ve got their own designator M1 only to emphasize their huge memory per vCPU: 15G for megamem, and 24 for the ultramem variety. While we do not yet use these, it is in the plans, for shuffling egs into cegs during nnet3 data preparation before training. No sensible NFS server can sustain the amount of disk churn during this process. And egs mixing is a two-stage process: first, it creates temporary files, and then packs the cegs archives from them. And, since this is Google, why not store all the temporary files in a RAM disk? The m1-ultramem-80 has 1.88TB of RAM, which is 3 times the largest cegs set I’ve ever seen; and the next one up, the 160, has 3.75TB of RAM, if 1.88TB is too small a temp space for shuffling your egs—but you are unlikely to need such an amount of temp space (ping me if you do! I have another tricks in my sleeve for you!). The process would complete very quickly, which may likely be both money- and time-saver. You’ll future-proof your rig if you select the zone where the m2-ultramem machines are available. They are not rare at all (most out of 67 current zones have them, and those that do are more likely to have other computing-intensive options).
  • N2. More recently introduced Cascade Lake machines, Xeon Gold series. About 15% more expensive compared to N1, but should be faster. We need to test that, but my hunch is that the mark-up is worth it. Just like N1, N2 is a no-frills general-use machine.
  • C2. These are the highest end machines based on Xeon Platinum 9282 underlying hardware. The CPU cannot even be socketed; it comes in a BGA package to be soldered to the motherboard by the manufacturerI’ve read an unconfirmed report that Google is the largest server manufacturer in the US, beating Dell and HP, although they do not sell any (as usual, “Google declined to comment”). I did not run complete real load test, but these must be 3 to 4 times faster than N1 on matrix computation loads. They cost only 25% more than N1, but there is a but: the C2 machines do not support custom configuration, and the standard ones have way too much RAM for Kaldi tasks. Still, these are likely to become the machines to go for our computations. I did not yet, but please compute and compare. Also, C2 machines are available in the fewest zones of all other types, although the gap is constantly shrinking.
  • N2D. These are the newest addition to the menagerie. They are based on AMD EPYC Rome CPU. These come in a variety of shapes, with up to 224 vCPU and nearly a TB of RAM, and support custom sizes. They do not have AVX-512 FMA units, so aren’t of a major interest to us as computing nodes. However, they are priced like N1, and are said to significantly outperform any other machine type on RAM-access-bound tasks, They look attractive enough for NFS server and small egs shuffler nodes, where file cache in memory or a RAM disk is the king. I did no experiment with these yet, however.

Well, this is pretty much it. Once again, N1 for nodes with GPU, N1, N2 or C2 for compute nodes, M1 or N2D for an egs mixer (in future), and N1, N2D and E2 for general, neither RAM nor CPU-bound machines.

CPU prices vary by location, albeit insignificantly. Comparing only major regionsThis is not an official term; I just refer to regions as “major” that offer nearly a full set of CPU and GPU species and more homogeneous, as opposed to usually more expensive “minor” ones, which apparently attract only a few large customers with specific needs in their geographic area. This distinction works for the US, but may be less applicable to EU or Asia, us-* regions are the least expensive, eu-* the next (+10%), and asia-* even a bit more (+15%).

Disks, OS Images, Snapshots

Touching this topic briefly, as our goal is to select a computing location and get up and running quickly; everything else can be changed easier.

As with any other type of storage, GCE disks are chargeable from the moment they are created and until they’re destroyed. Disks are zonalWhile such a thing as regional disks exist, they do not fit anywhere in our use scenario come in magnetic pd-standard and SSD pd-ssd varieties. The latter yield about 4 times more speed, and about ≈4 times more expensive, and is a best fit for high-IOPS applications. There is a lot of peculiarities about the performance tuning of both disk species, some of which we’ll touch later, and some you may research on your own if you wish. Price-wise, the disk price is the lowest in the US and Asia, and about +10% in EU. We boot from 10GB pd-ssd, and the shared NFS volume is of the pd-standard type.

Networking: speed and price

There are two terms that you might be unfamiliar with: ingress and egress. These mean simply the volume of incoming and outgoing network traffic (total number of GB within a billing cycle), but used mostly in contexts when the traffic is measured as a billable commodity.

All major cloud services charge nothing for ingress (e.g., when you upload datasets to buckets). On the opposite side, the traffic from the cloud to the Internet is always the most expensive, and the price depends on where you are located (USD $0.12 to $0.23 per GB; new pricing model comes into effect soon but as of now, there are too many empty cells in the table), not where you run the computation. This does not affect us the Kaldiers much; even if you download models, which are hardly even close to 200MB in size, that is still below 5 cents per download at the most expensive rate.

Internal traffic withing the GCP can vary from free to the full price cited above, and is free or nearly free if all moving parts have been set up correctly. As a general rule, traffic between machines inside a zone, from a zone to its containing region, and the other way round is free. In our admin tools, we are putting safeguards in place to prevent a novice user from making expensive mistakes: default firewall rules we create for a cluster allow communication between machines only within that cluster.

I just ran a couple of objective and subjective tests of networking speeds from home in Los Angeles to different locations. The objective part was uploading and downloading 1000 incompressible, random data with the length sampled from a normal distribution with $ \mu=524, \sigma=100$ and then another singe random file of the same total length. The subjective part is how responsive Emacs was with the X server on my own machine (and the X protocol is very sensitive to latency). All times are in seconds; the smaller, the better. Note that in all cases I used the Premium network tier, so the “last mile” connection entered Google’s network at the same and the nearest point-of-presence (POP).

There is no magic here. Google’s network is fast, but not infinitely fast, and the lags are inevitable. If you are a vim user, you’d probably notice less lagging even if you compute on the other side of the globe. X programs, however, offer much better experience if ran close to home. I’m not discouraging you, but test if it is an issue for you or not. There are ways indeed to remedy the slow connection issue (Emacs tramp package, for example, which loads remote files and saves them from your locally running Emacs).

So, where should I compute?

If you are in the North America, probably in the US: the regions us-east1 and us-central1 have everything. This is simple. If not, remember that for the large model, your bill is dominated by GPU charges. Balance the convenience of fast connection and cost. GPU is the biggest wallet crusher, and the price is same everywhere.

Always request quota in two regions: this lets you compare the preemption churn. The largest and the most popular zone, AFAIK, is us-central1. However, this was also where I observed significantly more preemption events that in us-west1 which I settled in: that was before the C2 machines had been introduced, so there was no difference in CPU variety, and I selected the zone closest to home that had everything I wanted to test.

The excessive churn is what eats into your preemptible machine savings. Suppose you are training, and train.py has already ramped up to 18 GPU nodes. One of these nodes suddenly dies. The rest of nodes will have completed their tasks and stay idle while Slurm is spawning up a replacement, and that replacement is done computing its shard. This may be a loss of 3 minutes on 17 fully booted and idle GPU nodes, or nearly an hour in GPU costs. The good news is that preemptible GPUs are inexpensive; if you lost 1 GPU×hr, that’s an extra $0.43 in expenses. Peanuts.

Preemption events are just part of GCE life; you cannot prevent them unless paying the full price for the GPUs, but that’s so expensive that you do not want it anyway. My large model trained for 48 hours even with nj_initial = 3 and nj_final = 18. There were 33 node losses during that time; and, of course, not all of them caused 17 GPUs idling as in the example above; in the beginning, there were fewer GPU nodes running. Still a bargain at $0.43 per GPU×hr compared to the full cost of $1.46 (and non-preemptible machines also have non-preemptible prices on their CPU and RAM, too).

Be aware that sometimes you cannot get anything running sensibly at all. Nodes start but disappear in a minuteIt’s worth noting that you are not billed for a preemptible machine if it’s preempted during the first 60 seconds of its life or two, or just fail to start at all. Also, when you work with a cluster for a while, you are starting to feel the pattern of its preemptible resource availability. The us-west1 zone is getting a rush of load around midnight. People love round numbers, and this is probably when most businesses start their nightly pipelines. Paradoxically, if you have a 12-hour computation, 1000 hours is, IMO, the best starting time. But these patterns are shifting; many are seasonal, others change just because. If you have a 48-hour job, well, it does not matter when you start it. If a lightbulb in your head just lit the word “weekend,” congratulations: you are not alone. Sometimes the preemption rate is even worse on the weekend than during the work week. YMMV, of course, depending on the region, zone, luck and the arrangement of starsThis is a common pattern in the US, a land populated by the weirdos who consider working on the weekend a chutzpah, regard said חוצפּה (khutzpah) a praiseworthy ambition, deem 10 days of vacation per year a luxury, and typing this very comment at 0200 hours on a Sunday morning nothing out of ordinary. I have no idea how busy GCE is in my native Europe, where restaurants and supermarkets are closing for the weekend, but I think it’s certainly work checking out!.

BurrMill currently has no direct support for running multiple clusters on different continents. We are planing to add this feature in beta 0.7, if there is enough interest.

I prepared a table of available computing resources around the world. It is current as of this moment, March 2020. I may update it once in a while; it’s probably helpful.

The structure is very simple. The zone column shows all zones that offer at lest one GPU type we can use. Horizontal lines divide the table into regions. Next three columns indicate availability of the three types of GPU of interest; non-empty symbol at the intersection means that these are present in zoneI am using the word present, not available, since sometimes, although rarely, they have no available quota. The last three columns are more heterogeneous. C2 is self-explanatory, the UM stands for m1-ultramem-n class (all of the 40, 80 and 160 vCPU sizes always go together—I’ve never seen otherwise). The last column marks the presence of the new N2D type. The “x”-s set in smaller type, because it’s a non-decisive feature, just in case you want to play with this new machine type. Similarly, V100 are marked with a small-type “$\$” mark standing for big $\$, since they are not an economical choice—I’m listing them in case your deadline is looming yesterday. By the way, if your deadlines have that inexplicably persistent propensity to loom yesterday, you must request a quota for this accelerator in advance, just like for any other.

One in 20 humans are at least partially colorblind. If you cannot see the horizontal highlighted strips in the table, please ping me in comments with an idea how to improve readability of the table—best of all, share a Google Sheet with an example highlighting you find readable. A Google Sheet “published for the web” is not traceable to the owning account, if your concern is anonymity.

The horizontal highlighted strips display selections of possible zones with the C2 and UM machines and at least one of T4 or P100 GPUs present. The quotes are allocated per-region, not per-zone. If you want to experiment with different accelerator and/or CPU types, find a region, better two, that suits you the best.

Note that moving your large shared NFS disk within a continent (multiregion, essentially) takes time, 30-40 minutes altogether, but does not incur any noticeable charges. If, for example, you start your work in europe-west4-b using the T4 GPUs, and later want more oomph from the P100 accelerators, you may move the disk by snapshotting it to the eu multiregion and resurrecting in europe-west1-d. Both of these actions are free of network charges (you will be charged for the snapshot storage, but this is a small charge, you may delete it right away. Moving into the us will incur charges, however at the intercontinental rate: transferring a 600GB snapshot (the size of my 2.5TB disk’s with experiments with nearly 5K hours of speech and most egs not cleaned; snapshots take only the used portion of the disk) will cost you 600×$0.08=$48. Recreating features or egs will also cost time and money. Plan to stay within a continent for a single series of experiments.

Homework time!

Before proceeding with the next section,

  1. Choose one, or maybe two regions which you find convenient to do your computation, with zones containing your desired GPU and CPU, very preferably on the same continent. Currently, we do not even have tool support for for multi-multiregion computing, and when we do, it will be limited to deploying new clusters due to high costs involved in moving the large shared NFS disks, as it was just explained above.
    Example: you decided on C2-CPU based machines, and want to compare P100 and T4 GPUs, or use them for different loads. Looking at the resource table you find that there is only one region in the world that has both GPU types and C2 CPUs in the same zone, namely the zones us-east1-{a,d}. us-central-1 also has both, but in different zones. In other parts of the world, you’ll probably have to get allocation in two regions, such as europe-west1 and europe-west4, or asia-northeast1 and asia-east1.
  2. Hopefully you choose regions in a single multiregion; write down that multiregion name. Do not use two, at least for now. If you cannot get quota for everything you want in one, ping us for assistance. You’ll be the beta tester of that new feature.
  3. Come up with a string identifier of your project, considering that
    • GCP can generate one for you, but they are relatively long and hard to remember due to 6-digit random suffix
    • It must be 6 to 30 symbols long, contain only lowercase ASCII7 letters, digits and hyphens, start with a letter, end with a letter or digit.
    • You will have to type it once in a while.
    • It should be not very easily guessable; keep it semi-private.
    • It cannot be changed.

Examples of not good names are burrmill or my-burrmill (you may be lucky if they are not taken, but they are easily guessable). Example of a better names is ohmycoffee-322 (easy to remember, easy to type, not guessable). You may also simplify the name generated by GCP for the autocreated project you’ll be deleting when it comes to it; inspired with that real example, upbeat-bolt-271409, you may come up with upbeat-bolt-409 or even upbolt409. In short, something not very long, so you’ll remember it and can type quickly, and not very guessable. I like names that make sense, but that is not a requirement.


In the next, third post of this series, we will walk you through setting up GCP and command line tools. Make sure to make the above selections which we’ll use in the fourth one.

Leave a Reply