Supercharge SSH

In this part we’ll go through IAP tunneling feature of GCP, OS Login and its key management, and some ideas of SSH setup that will make connecting to the cluster easier.

IAP Tunneling

None of the VMs on your network are accessible from the Internet: this is the default firewall settings, and we did not open any ports to the whole world. To access machines inside, GCP offers an option of Identity-Aware Proxy (IAP) Tunneling. Connections are established from a narrow range of IP addresses, guarded by Google, so that you do not need commonly used “bastion hosts” to protect your network. We created a firewall rule to allow this IP range to connect to the SSH port 22 to any instance on the network. As long as you are authenticated the Cloud SDK, you, as the owner of your project, are automatically granted access to the tunneling feature. If you add other accounts but do not grant them the Project Owner role, you’ll need to grant them access to tunneling, but this goes beyond our current walkthrough.

If you peeked at the documentation page liked to above, and wonder which of the two options we use: the third. It’s not documented but is not likely to go away: gcloud compute ssh uses this command internally, and it has to persist in one for or another. gcloud compute ssh is convenient but has its own drawbacks. For one, it may create an SSH key for you. I do not know about you, but I am very meticulous when it comes to managing the SSH keys kept on all machines I use, and for this task am grouping the keys and other files together. Another is that you need to pass the command the zone name (you can have the default zone set in the local configuration on any given machine), but I wanted something more flexible.

We have a supplementary script burrmill-ssh-ProxyCommand, which attempts to locate the host name you pass in ssh command line from multiple sources. Without going into details (you are welcome to read its source), this utility allows to connect to any uniquely named VM^✼Full unique name of an instance also includes the zone, so that you can have two instances with the same name in two different zones. in the project by simply giving its name.

At this point, if you’ve used SSH for a while, you should have noticed that one critical component is missing from the flow. The tunnel allows your SSH client to connect to the server, but you need a way to authenticate to it, and the most common way to authenticate to an SSH serer is registering a public half of the key pair with the server. But in a dynamic environment, when machines may temporarily pop out if nothing, or be rebuilt on demand, this is not as easy as adding your key to the ~/.ssh/authorized_keys file. Sort of a catch 22 (referring to the SSH port number, of course): to add the key you need to access the machine first, but…

Be in control of OS Login keys

…in fact you do not. If you were to use gcloud compute ssh, it would have created the key and registered it. But as long as you want to keep your keys in a neat order, and know which to remove and which to keep, manual key management is the best option.

The OS Login SSH keys do not offer an additional security over IAP: anyone in possession of your Cloud SDK credential can add their own key and connect to your instances. It does not make sense to password-protect them, even if you normally encrypt your SSH keys.

The golden rule of SSH key management is the private key never leaves the directory it was created in. Sometimes people copy their private key (highly sensitive; for example, GitHub keys) across machines out of convenience, but if one of these machines is lost, you’ll have to revoke the single key, which pays you back with a massive inconvenience of creating a new one for all computers you work on. And if none of the machines has physically disappeared, but you see the signs that someone is using the key, you should better know from which machine the key has leaked. Always maintain one key per machine, be it for GitHub, for BurrMill or anything else.

To be able to establish the IAP tunnel, you should install the Cloud SDK, so that the gcloud command is available and authenticated.

OS Login is a GCE service that keeps your private keys with the GCE project, together with your Unix account information, including login name and UID/GID. When you attempt to connect to a node with SSH and your private key, sshd, the SSH daemon, makes a call to OS Login API that returns your public key, so that you never need to copy it explicitly on the machines. The OS Login profile is initially empty:

$ gcloud compute os-login describe-profile
name: '116648612595226578317'

The full profile is created when you add a key to it. On the local machine that you will use to login to your cluster, run ssh-keygen to generate a new keypair, and add its public key to the profile. Identify the key with a comment so that you know which key to revoke when it comes to it. Adding details such as key purpose and date is also helpful: your keys will persist for years^✼NIST currently recommends changing passwords only when a compromise is suspected; the same recommendation can sensibly be applied to any kind of credential. In the example below, the machine is called kiki, because it’s a cool name, but make sure to substitute your own machine name in the key comment. I prefer ED25519 keys because they are short and do not obscure the output of gcloud compute os-login describe-profile. -f supplies the base name of the key file, and -N '' is for an empty key password.

$ cd ~/.ssh
$ ssh-keygen -f burrmill -t ed25519 -N '' -C 'IAP access from kiki 20-03-31'
Generating public/private ed25519 key pair.
Your identification has been saved in burrmill.
Your public key has been saved in burrmill.pub.
The key fingerprint is:
SHA256:U5D+xO7vn454iyp3DyY2Z0YxduUY9OiyNfSrQ/Ox2zc IAP access from kiki 20-03-31
The key's randomart image is:
+--[ED25519 256]--+
|        .. .o .  |
|        ..   B   |
|       . .= = o  |
|        .oo* .   |
|        S+o + .  |
|         oo+o... |
|        +.B. o.o |
|      ...Bo+oooE.|
|       o.oo=B==.+|
+----[SHA256]-----+
$ ls -lgG burrmill*
-rw------- 1 419 Mar 31 14:10 burrmill
-rw-r--r-- 1 111 Mar 31 14:10 burrmill.pub
$ cat burrmill.pub
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMGP5QpApLZMZBqYpg1sGuavbgPvkCnapwGef0vd0TFj IAP access from kiki 20-03-31

The last cat command shows you how the key looks. The first two tokens are key type and encoded data respectively, and the rest of line is the comment. Now you can add it to the OS Login profile, and it is immediately populated:

$ gcloud compute os-login ssh-keys add --key-file=burrmill.pub
loginProfile:
  name: '116648612595226578317'
  posixAccounts:
  - accountId: solo-mill-12pc
    gid: '1051760918'
    homeDirectory: /home/han_solo_gmail_com
    name: users/han.solo@gmail.com/projects/solo-mill-12pc
    operatingSystemType: LINUX
    primary: true
    uid: '1051760918'
    username: han_solo_gmail_com
  sshPublicKeys:
    5cc9770f33450a9c9c9b48f1bb5717dd59858e21427115e242327c0e3493c877:
      fingerprint: 5cc9770f33450a9c9c9b48f1bb5717dd59858e21427115e242327c0e3493c877
      key: |
        ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMGP5QpApLZMZBqYpg1sGuavbgPvkCnapwGef0vd0TFj IAP access from kiki 20-03-31
      name: users/han.solo@gmail.com/sshPublicKeys/5cc9770f33450a9c9c9b48f1bb5717dd59858e21427115e242327c0e3493c877

You cannot change the username, and its long, not something you want to type each time you connect! Make note of it, as you’ll need it to configure SSH next.

You add keys from other machines the first time you want to connect from them. Preferably do not copy the same key over multiple computers. Delete keys from the profile when no longer needed. The example below shows how to do that (you can re-add the key; this is what we in fact do, so you can run these commands to practice):

# Find the key you want to delete. This is where you need the comment!
$ gcloud compute os-login describe-profile
name: '116648612595226578317'
posixAccounts:
- accountId: solo-mill-12pc
  gid: '1051760918'
  homeDirectory: /home/han_solo_gmail_com
  name: users/han.solo@gmail.com/projects/solo-mill-12pc
  operatingSystemType: LINUX
  primary: true
  uid: '1051760918'
  username: han_solo_gmail_com
sshPublicKeys:
  5cc9770f33450a9c9c9b48f1bb5717dd59858e21427115e242327c0e3493c877:
    fingerprint: 5cc9770f33450a9c9c9b48f1bb5717dd59858e21427115e242327c0e3493c877
    key: |
      ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIMGP5QpApLZMZBqYpg1sGuavbgPvkCnapwGef0vd0TFj IAP access from kiki 200331
    name: users/han.solo@gmail.com/sshPublicKeys/5cc9770f33450a9c9c9b48f1bb5717dd59858e21427115e242327c0e3493c877
$ gcloud compute os-login ssh-keys remove --key=5cc9770f33450a9c9c9b48f1bb5717dd59858e21427115e242327c0e3493c877
# Your profile has no keys now:
$ gcloud compute os-login describe-profile 
name: '116648612595226578317'
posixAccounts:
- accountId: solo-mill-12pc
  gid: '1051760918'
  homeDirectory: /home/han_solo_gmail_com
  name: users/han.solo@gmail.com/projects/solo-mill-12pc
  operatingSystemType: LINUX
  primary: true
  uid: '1051760918'
  username: han_solo_gmail_com
$ gcloud compute os-login ssh-keys add --key-file=burrmill.pub
  . . .

You must re-import at least one of the same key to create a profile for a different GCP project, if you have more than one. This seems kind of illogical (and probably is), because SSH keys belong to your account, not a project, but this is the only way to create an entry for a new project in the posixAccounts array.

Configure SSH to connect to cluster

In this section change your current directory to ~/.ssh/

Feel free to skim any section if you already familiar with these SSH tricks; you’ll immediately recognize those that you are already using.

SSH configuration for the Jedi initiate

Is it a spring cleaning time for your SSH configuration? Do you even have one? I want to share bits of my SSH-fu; you do not have to subscribe to my practice, but I tweaked these patterns for years, and believe they do make my work efficient overall. I rarely pass switches to SSH. I do not normally pass the username to it. My Git remote URLs are uniform, like [ssh|http |git]://ghithub.com/burrmill/burrmill.git, which helps when switching protocols. All these tweaks apply to the single config file, ~/.ssh/config. If you do not have it, create it.

Let’s look at the first few lines of my ~/.ssh/config file:

Host github.com
  IdentityFile ~/.ssh/github.com
  User git
  UserKnownHostsFile ~/.ssh/github.com.knownhosts

Host gitlab.com
  IdentityFile ~/.ssh/gitlab.com
  User git
  UserKnownHostsFile ~/.ssh/gitlab.com.knownhosts

Host gerrit.asterisk.org
  Port 29418
  User kkm
  IdentityFile ~/.ssh/gerrit.asterisk.org

Host kiki 10.200.0.11
  HostName 10.200.0.11
  User kkm
  IdentityFile ~/.ssh/kiki
  SendEnv GIT_* LANG LC_* LESS MANOPT MANWIDTH TCP_X_DISPLAY TIME_STYLE

What is going on here? First, indents are unimportant, but usually added for readability. SSH has two conditional clauses, Host and Match. Everything between any two conditional clauses applies to the top one; We’ll use the word block for the conditional and all options that follow and are applied by it when it matches. Everything before the first conditional in the file applies to all hosts, but cannot be changed by later statements. If you want to set defaults for values not otherwise set, do it at the end of the file, after a Host * conditional, which matches any host.

Remember these two precedence rules:

SSH client reads the config file top to bottom. As soon as a conditional clause matches, all options under it receive their final values. Subsequent matches do not change these options. The assignment is tracked per individual option: the values for options not set in previous successful matches are changeable by the subsequent match blocks. No option is ever assigned twice.
Your ~/.ssh/config is read before the system-wide /etc/ssh_config, so your file has the precedence.

The first three blocks match Git remotes. The IdentityFile is the private key that is used for the matching host, and User sets the username if not provided on the command line. Let me illustrate this; focus on the first block.

$ ssh github.com
PTY allocation request failed on channel 0
Hi kkm000! You've successfully authenticated, but GitHub does not provide shell access.
Connection to github.com closed.
$ ssh kkm@github.com
kkm@github.com: Permission denied (publickey).

The second command overrode the username to kkm, and this user either does not exist, or my private key does not match theirs. The first command, without the user@ part, defaults to the username git from the config file, so this is an exact equivalent of ssh git@github.com, the recommended way to see if your public key is registered properly.

UserKnownHostsFile is useful in two cases. One is for hosts that may change their identity key often, as it happened with certain remote Git servers. If you use the default ~/.ssh/config/known_hosts file, it becomes polluted with old keys, or end up with ssh yelling at you and refusing a connection. If the servers asks to accept a server key often, it’s better to redirect these keys to a separate known hosts file. You may delete them once in a while. The fact that they flunk at managing their shared identity is irrelevant: this is what they do, and a separate host identity file is a good way to cope with this. The second use case is good for tighter identity management. We’ll get to that in a moment.

The last block is for a machine on my home network. Note that the Host conditional has 2 patterns here, separated by a space, kiki and 10.200.0.11; a clause matches when any of the patterns does. The HostName option tells SSH to connect to this host (by name or, as in this case, the IP address). The host has no DNS record or other means of resolving the name, so I cannot ping kiki: it reports that the host name cannot be resolved into an IP address. But ssh kiki does connect, because it uses the IP address set in the HostName clause.

Configure SSH client like a padawan

Let’s now set up the configuration for your GCE hosts. Let’s start with a bare minimum, and add parameters in a few iterations. It is important that you understand what we are doing. All parameters are described on the ssh_config man page.

# Common GCE defaults.
Host x?-* q?-* boot-* target-* instance-*
  User han_solo_gmail_com
  IdentityFile ~/.ssh/burrmill
  HostKeyAlias burrmill_all_hosts
  ProxyCommand burrmill-ssh-ProxyCommand %h %p

# Deployed BurrMill clusters with the strict known key.
Host x?-* q?-*
  UserKnownHostsFile ~/.ssh/burrmill.knownhosts
  StrictHostKeyChecking yes

# For temporary GCE machines with random keys. Send them to dumpster.
Host boot-* target-* instance-*
  UserKnownHostsFile ~/.ssh/burrmill.unknown_hosts
  StrictHostKeyChecking no
  LocalCommand /bin/rm -f ~/.ssh/burrmill.unknown_hosts
  PermitLocalCommand yes

Instead of x?-* q?-*, use the naming patterns of your clusters. Note the three other conditionals, boot-* target-* instance-*. The first two are prefix names of machines in Daisy workflows that build the OS image and the CNS disk, which you’ve already ran. I use them for debugging; you hardly need them. The instance-* expression is for the default name of a machine created through the Cloud Console Web interface, which by default is like instance-1, instance-2 and so on. When I launch a one-off instance to test something, this comes in handy. You may want to keep this one.

The common defaults are set in the first block. The HostKeyAlias option replaces the hostname for the purpose of adding or checking the key in the known hosts file. Normally, absent this option, the hostname is used, but in our case this won’t work: all machines in any cluster share the same host identity key, global in the GCP project. The format of the known host file is a host identity per line, and we cannot add all hosts in advance, because you will to connect to brand new clusters even to a compute node, and these may count nearly a hundred. With the host key alias you add just one key instead, with an arbitrary alias that is specified by this option, and all matching hosts will be checked against this hostname surrogate. Your file, which we’ll create in the next step, should normally contain just one line with the common public key shared by BurrMill hosts. The keypair has been already generated by bm-update-project. Here’s how this file usually looks:

$ cat burrmill.knownhosts
burrmill_all_hosts ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBF+KUJghi6NgvbFkPgieCBQPQZMxF6NJ3G2pYfMi22G

Another directive, ProxyCommand, tells SSH to run this command and pipe all communication through its stdin and stdout. This is where the IAP tunnel is established, and all SSH communication flows inside this tunnel. %h %p tells SSH to substitute respective hostname and port from the command line (or the default port). The proxy is a very simple concept: instead of communicating over TCP with the host, SSH communicates with a proxy program in exactly the same way as if it were a direct TCP connection.

The burrmill-ssh-ProxyCommand is located in the ./bin/ directory of BurrMill, and is normally in your path.

In the second block, we point to our nearly immutable known host file, and tell SSH not to update it, and refuse to connect if the key does not match by using the StrictHostKeyChecking yes option. This is a minor security measure, and also tells you that your matching patterns are correct. SSH refuses to connect with quite a profuse diagnostics, so you’ll immediately recognize if the configuration is wrong.

The third block instead relaxes host checks. The one-off machines will each have a new, randomly generated key which cannot be validated. SSH does not have a configuration option for this exact behavior, so, as a workaround, we tell it to save the key without asking into a file with special name burrmill.unknownhosts (do not miss the un part!), and then delete this file right away as soon as the connection is fully established. PermitLocalCommand yes is required to allow the rm command to be executed; without it, the LocalCommand is silently ignored. Deleting the file is important, otherwise SSH may refuse to connect with the error “instance-1 host identity changed,” since instance-1 is the most common name for those one-off test instances, and they are created with a different host identity key each time.

Copy the BurrMill cluster identity key

The public part of the generated key is stored in the Software bucket of your project. Either use the Tab key to autocomplete name after gs://softw, or list buckets with the gsutil ls command to find its full name.

# The printf command creates the file without a final newline.
$ printf 'burrmill_all_hosts ' > burrmill.knownhosts
$ gsutil cat gs://software-PROJECT_NAME-SUFFIX/hostkey/ssh_host_ed25519_key.pub >> burrmill.knownhosts
$ cat burrmill.knownhosts
burrmill_all_hosts ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBF+KUJghi6NgvbFkPgieCBQPQZMxF6NJ3G2pYfMi22G COMMENT

Power up your SSH config like a Jedi Knight

The configuration so far will already let you connect to any cluster node (you normally connect to the login node), but what if you need to hop to a different node from there, to check logs or CPU usage, or copy files inside the cluster using scp? Your secret key is on your local machine, not available on the login node, so you won’t be able to authenticate to any other nodes: they know the private key from the OS login service, but you no longer have its private half! The solution is to make the private key temporarily travel with you without being saved on disk. The SSH Agent is a program that keeps your key in memory as long as you are logged on to a node far, far away.

First of all, you need the SSH Agent running on your machine. Normally, in graphical X Windows session it does. If, however, you start from a headless machine, the agent may be unavailable. If you use both local X sessions and remote into the same machine, the Agent may become split: one of them has the key, but another, in your ssh remote session, has none. Check if you have the SSH_AUTH_SOCK variable in your environment (env | grep SSH_). Some distros run GPG Agent in the SSH Agent emulation mode, but it does not cut it when you have multiple SSH sessions to your headless machine (the one where the private key is stored). If you do not have this variable, and your system is based on systemd (type systemctl status --user to quickly figure out), we have a solution for you (a copy is also in ./tools/); download the file and follow the detailed instruction in the comments. If not, read man ssh-agent for how to run it on your system in a “classic” daemon manner. The main advantage of the service in the Gist is that it shares a single instance of Agent across all your sessions. This is not easy to achieve without systemd (and even with it, still requires a hack to retrieve it in the .bashrc!).

If you cannot get SSH Agent working, it’s still okay (but please either ask in the Q&A forum, or open a bug if you think it is), but you won’t be able to hop SSH connections. If you are able to run it, add these option to the “cluster” (second of the three) blocks:

# Deployed BurrMill clusters with the strict known key.
Host x?-* q?-*
  UserKnownHostsFile ~/.ssh/burrmill.knownhosts
  StrictHostKeyChecking yes
  AddKeysToAgent yes
  ForwardAgent yes

Generally, forward agent only to the hosts that you completely trust. The host gets access to your private key, and can connect on your behalf to anything you can. Think of it as if asking “can you please hold my wallet for a while? I’ll be back in 15 minutes”—quite a significant level of trust is implied if you make such a request.

The first option tells SSH to put the key used to authenticate to the host into the in-memory Agent store. This is a safe option. The second one tells SSH to allow the remote host to request that key. This is what would happen if you are logged on to qe-login and run ssh qe-node-std-12: SSH client on qe-login pulls your private key from the agent on your machine and uses it to authenticate on qe-node-std-12. The OS image in BurrMill has all the necessary presets for that. Of course, you should restrict forward the agent connection only to hosts (such as the qe-login in this example) that you control and trust.

SSH opportunistic connection sharing for Jedi Masters

With each of the power-ups you trade security for convenience. You can fully trust the BurrMill environment because you control it, so this is justified. But use your newly gained Force wisely in other environments.

One of the inconveniences of the IAP tunnel that it takes 2-3 seconds to establish (and our proxy command adds an extra couple seconds, too; we are working on optimizing it). When you connect to the same host from another terminal window, a new tunnel needs to be established. You can tell SSH to reuse the existing connection and multiplex all communication with the remote host within a single SSH connection, and, consequently, reuse the tunnel. The mode is called opportunistic because SSH first attempts to reuse a connection if it exists, but if not, establishes a new one. Even after you disconnect, the multiplexing socket stays alive for a while, so next ssh same-host command will connect you instantly as long as the multiplexing SSH client daemon and the socket stay alive.

You can close the multiplexing socket at any time using ssh -O exit host-name

The options that control multiplexing are highlighted in boldface.

# Common GCE defaults.
Host x?-* q?-* boot-* target-* instance-*
  User han_solo_gmail_com
  IdentityFile ~/.ssh/burrmill
  HostKeyAlias burrmill_all_hosts
  ProxyCommand burrmill-ssh-ProxyCommand %h %p
  ControlMaster auto
  ControlPersist 43200
  ControlPath ~/.ssh/sockets/%C

ControlMaster auto enables the opportunistic mode. ControlPersist seconds determines the lifetime of the multiplexing socket. Choose the timeout sensibly. From the safety of your home, you can hold it for a while (like 12 hours, set above). If you are permanently planted yourself into a coffee shop with your notebook, do not let it run for too long; 10-15 minutes is more than enough if you disconnect temporarily but want to reconnect. Do not set this option to 0 for indefinite timeout: you’ll forget about it after a couple weeks, and will be just wasting resources and keep a security loophole open, unless you log off or reboot the machine.

ControlPath points to the socket filename. SSH silently ignores the multiplexing setup if the directory does not exist; also, there is no default for this option, so it must be set. If you have set up the systemd service from the Gist on the previous step, it will have created the directory^✼either under ~/.ssh/ or as a link to an in-memory filesystem under /run. The $XDG_RUNTIME_DIR variable points to this per-user small tmpfs mount. with correct permissions. If you do not see the ~/.ssh/sockets/ directory, create it with access permissions for yourself only:

$ mkdir sockets
$ chmod 700 sockets

Unlike private key files, the ~/.ssh/config file can, and probably should, freely travel between the machines you are normally working on: treat it as any other dot-file.

Now cd back to the BurrMill root directory.

Your SSH is now set to connect to the cluster in the most unobtrusive way: just type ssh hostname, and you connect to it as yourself. The only thing missing is the cluster itself, but in the next sections we’ll take care of it. The next section focuses on the GCE principles of disk throughput allocation, and is more theoretical. The last section will be dedicated to practical deployment and smoke checks of the new cluster.