Our Cloud HPC wins HPCWire 2018 Best Use of HPC in the Cloud

Wow,
it’s been less than 2 years since Sanger started it’s live private cloud service. Over this time we’ve had an incredible journey with both the OpenStack community at large and our local developers, research staff and Vendors.

We have just heard from SuperComputing 2018, that we have been awarded the readers choice for the Best use of HPC in the Cloud. This is a considerable achievement for the IT, informatics and project teams that have been a part of this journey.

Many thanks to all.

https://www.hpcwire.com/off-the-wire/hpcwire-reveals-winners-of-the-2018-readers-and-editors-choice-awards-at-sc18-conference-in-dallas-tx/

More updates to follow shortly !

Let’s Encrypt the RADOS Gateway

Introduction

The RADOS Gateway (rgw for short) is a component of Ceph that provides S3-compatible storage. Our users use it to make data publically available, as well as sharing data privately with collaborators.

Naturally, we want to use HTTPS for this, which means we need a TLS certificate. This needs to be a wildcard certificate, as S3 typically puts the bucket name into the request hostname (e.g. for our service cog.sanger.ac.uk, bucket foo would be referred to as foo.cog.sanger.ac.uk).

We use a commercial wildcard certificate for our production S3 gateway, but these have a number of downsides (renewal/rollover is typically a manual process, and they cost money – around £75/year from the Jisc provider, up to £180/year from a commercial provider); so we decided to look at LetsEncrypt (LE) for our test S3 gateways with a view to using LE in production when our current certificate expires.

This article briefly outlines how we set this up, in the hope it might be of interest to others.

LetsEncrypt

LetsEncrypt (LE) provides a free, automatic certification authority, based on Free Software. It is a non-profit, and funded by donations. Its certificates are short-lived (90 days), so you essentially have to automate their acquisition and deployment.

The protocols used by LE are well-documented elsewhere, but essentially it’s a challenge-response system – you have to prove to LE that you own the domain(s) you’re requesting a certificate for. There are two ways to do this – over HTTP (i.e. putting the challenge token on a web-server) or DNS (putting the challenge token into a special
DNS record).

Dehydrated & DNS updates

There are a number of LE clients available; we chose dehydrated, because it’s a fairly simple shell script, doesn’t need (or expect) root access, and doesn’t try and do any of its own crypto. Also, the set up is pretty straightforward.

For S3, we need a wildcard certificate, and those are only available using the DNS-based challenge. Dehydrated supports this; you need to supply a hook to let it update the relevant DNS records. The Dehydrated wiki has hooks for a number of providers and resolvers, but not one for Infoblox, the BIND-based DNS/DHCP/IPAM platform we use. So we had to write one. Essentially, this involves POSTing small JSON fragments to the Infoblox API:

{
 "method": "POST",
 "object": "record:txt",
 "data": {
   "name": "_acme-challenge.${domname}",
   "text": "$token",
   "view": "external"
 }
}

Beyond that, dehydrated needs relatively little configuration – a domains.txt containing the domains we want certificates for, and a configuration file most of which can be left as defaults. We change a couple of things to specify the LE API endpoint (there’s a staging endpoint that can be used for testing, to avoid various rate limits), that we want to use the DNS challenge type, and where the hook script is:

CA="https://acme-v02.api.letsencrypt.org/directory" #Production v2 endpoint
CHALLENGETYPE="dns-01"
HOOK=/usr/local/lib/dehydrated/infobloxhook.sh

Key distribution

At this point we can get a wildcard certificate for *.cog.sanger.ac.uk, but we still need a way to deploy it across our rgws. Since each rgw is a Ceph client (and so has a Ceph credential), the easiest way to do this is using a Ceph pool. So we make a small Ceph pool, and then adjust the hook script to store the key, certificate, and certificate chain in the pool:

"deploy_cert")
  rados -n "client.rgw.$hn" -p rgwtls put privkey.pem "$priv"
  rados -n "client.rgw.$hn" -p rgwtls put cert.pem "$cert"
  rados -n "client.rgw.$hn" -p rgwtls put chain.pem "$chain"
;;

Each rgw then runs a cron job that gets the key, chain, and certificate out of the rgwtls pool, performs some sanity checks, and then copies them into place if they’re different to the currently-deployed set. The web server within the rgw is civetweb, which expects a single .pem file containing key, certificate, and intermediate chain.

cat privkey.pem cert.pem chain.pem >rgwtls.pem
if ! cmp -s rgwtls.pem /etc/ceph/rgwtls.pem ; then
    mv rgwtls.pem /etc/ceph/rgwtls.pem
    systemctl restart ceph-radosgw@rgw.$(hostname -s)
fi

In outline, that’s all there is to it! One script to negotiate the DNS challenge with LE and copy the key/certificate/chain into a Ceph pool, and another script to deploy them.

Automated deployment

We use Ansible to manage our Ceph deployment; specifically, we manage a local branch off the 3.0 Stable version of ceph-ansible. So we developed a small role to automate deploying this on our Ceph clusters. It’s mostly just the obvious copying scripts into place, installing cron jobs, and so on. We do, however, only want one of our rgws to actually be running dehydrated, though (otherwise you’d have multiple dehydrated clients getting different keys and certificates for the same domain, which would lead to confusion!); we achieve this by arranging for it only to be installed and run on the first member of the rgws group, e.g.:

- name: Install wrapper script round dehydrated
  copy:
    src: cron_wrapper.sh
    dest: /usr/local/lib/dehydrated/
    mode: 0755
  delegate_to: "{{ groups.rgws[0] }}"
  run_once: true

Issues

We found a couple of wrinkles that are worth noting. Firstly, the LE servers do not share state between them, which means you need a stable external IP address for the DNS challenge to work. If you get strange errors from dehydrated, it’s worth checking this; our scripts achieve this by using a single IP address of our local web proxy as HTTP proxy.

Secondly, the civetweb used in Ceph versions prior to Luminous has no mechanism for noticing when its TLS key/cert change, so you have to restart the rgw, which is disruptive. Given the short validity period of LE certificates, this equates to a restart every 90 days. S3 clients should retry, so this is unlikely to cause a problem, but it’s something to be aware of. With Luminous or later, you can set ssl_short_trust in your civetweb configuration, and then the restart isn’t needed.

Show Me The Code!

We’ve glossed over a lot of details of error-handling and suchlike, in the interests of brevity. If you actually want to try this yourself, then you’ll want to care about the tedious details 🙂

Our Ansible role is available on github; it’s not exactly a drop-in role – you’ll need the hook script to talk to your DNS machinery, for example, and store the related credential using regpg or similar. But you can see all the error checking details that I glossed over above, and it’s hopefully a useful starting point.

We’ve offered the role to the upstream ceph-ansible project, so it may yet appear there in future…

More and more PB

Our iRODS infrastructure continues to grow. We have just signed off another >1PB of iRODS capacity to our CASM group. This brings our current iRODS archives up to approximately 14PB total capacity, between Sanger and JSDC data centers.

Another couple of PB added…

At the end of 2017, we delivered another 1PB of Ceph storage to our rapidly expanding internal cloud, the flexible compute environment. Over the course of 2017-2018, we have gone from 0PB to 5.5PB of usable Ceph based storage.

This has become our most rapidly deployed storage platform to date and stability in particular has remained outstanding, we have more details available here:

Seagate / Cray Lustre system now up and running.

At the end of 2017, our first Seagate-Cray Lustre system passed acceptance testing and our scientific groups are now busy migrating their data and workflows.

This is an L300N system, which includes a transparent NXD flash layer to aid smooth data access from heavy random and short data access, more details are available here:

https://www.cray.com/products/storage/clusterstor

The older systems are due for retirement on the 4th of March, when thy will become available for further development and testing.

First Cray / Seagate Lustre procured by Sanger !

Sanger has become Cray – Seagate first Lustre customer ! The system has arrived and is undergoing acceptance testing.

We intend to complete test and introduce the system by the end of November 2017. This is a very aggressive timeline with a view that the older filesystem areas will be retired by March 2018.

New hardware now on site

Our 2017 financial year end has brought with it a number of procurements. Over the next few months we will be installing:

2PB of Cray/Seagate lustre, 2PB iRODS, 1PB Ceph, a new OpenStack zone, an updated SciaaS platform, iRODS performance server units, SQL servers and a full OpenStack test environment.

We are also preparing for our new DC quadrant, which is expected online 2018.

Should be fun, we will keep our internal customers updated through the IT project portal in the usual way.