Too many ssh-agents and the keychain

This is the tale of three really simple and reliable tools, rsync, cron, and keychain, combining to create some confounding emergent complexity.

ssh-agent is a daemon that holds onto your credentials so that you only have to enter a password once a session. Once you’ve entered your password for your private key once via ssh-add, ssh-agent decrypts your private key and holds onto it. Whenever another program needs something to be done with a private key, it asks ssh-agent to do it. ssh-agent passes the results of that operation to the client program instead of letting it have the unencrypted private key.

Some limitations of ssh-agent are:

  • When the user logs out, ssh-agent may shut down, and the user will have to run ssh-add again.
  • Cron jobs and other automated processes don’t have access to the agent, and thus, need to find another way to provide credentials for their tasks.

On many Linux distros, there is a keychain program that will keep the same ssh-agent running across logins and also publishes information about it so that other programs (including those running in cron jobs) can get at it.

Scheduling authentication-requiring tasks with cron

I was trying to get cron to run a bash script that ran rsync to copy a directory on a remote machine to a local directory.

This didn’t work. In the context of the cron job, I kept getting Permission denied (publickey) from the rsync command.

After sshing into the remote machine to trigger a login and get my password into ssh-agent, I ran the script directly in the terminal, and it worked without prompting.

Something I forgot (for a while) is that I did not do this in a login shell. In order to simulate the cron environment, I ran sh, and ran the script in that shell.

It still failed via cron job, though. I thought that ssh-agent would have the unencrypted private key it would need at that point, but clearly something was still wrong.

The environment

In my script, I had a line like this, which I thought would let my script get access to ssh-agent:

~/.keychain/machinename-sh

That file, which is generated by keychain, is a shell script that exports two environment variables:

  • SSH_AGENT_PID
  • SSH_AUTH_SOCK

The process id and the socket are what programs need to communicate with ssh-agent. When I echoed those variables in my script, they were empty. When the script ran rsync, I got prompted for credentials. (This is a problem because you can’t be around to enter credentials for most automated tasks.)

When I updated the script to use source ~/.keychain/machinename-sh, those variables were filled out and pointed to an existing process and an existing sock file. (I don’t have a good explanation for why source worked, but running the keychain file as a script didn’t work.)

However, it still failed with “Permission denied” when I ran it via cron. I was able to log those two environment variables, and they still referred to things that existed.

After hours of failed attempts to flush out information that could tell me what was going on, I ran ps aux | grep ssh-agent again. I noticed that the PID of the ssh-agent that I used from the successful manual run was NOT from the PID in SSH_AGENT_PID.

The ssh-agent in this env var seemed to not have my key.

So, I logged out of the host machine in all terminals (I had three), then logged back in, and keychain ran when I logged in. I then ran ssh-add to add my key. I made sure there were no other ssh-agent processes running.

Then, the script ran via the cron job, and the rsync command worked. The PID that the script logged matched that of the single ssh-agent running.

What I think happened was this:

  • The first time I logged in to the host machine, keychain ran, but I did not, directly or indirectly, ssh-add my key.
    • keychain wrote the ssh-agent PID and socket into ~/keychain/machinename-sh.
  • I then logged into another terminal, but then start an shshell — a non-login shell which did not run keychain.
    • There a new ssh-agent was started.
    • My key was added to that one.
    • The script ran successfully (when run directly, outside of cron), using that ssh-agent.
  • I fixed the script’s importing of the keychain environment variables, but they referred to an ssh-agent (the first one) that did not have the key added to it.
  • When run via cron, the script ran rsync which asked ssh-agent to log into the remote server, but the key for it was in a different ssh-agent process, not that one. So, it failed with Permission denied (publickey).

So, make sure you track your ssh-agents (try to have only one) and pay attention to ssh-agent keychain is pointing clients at.

#linux #unix #auth #keychain #ssh #cron