New hardware incoming Lustre 112/113 upgrade time

After many years of robust service, our Lustre scratch112 and 113 systems are approaching end of life. We are delighted to say that following significant testing and a procurement bakeoff, we have selected Seagate-Cray as the vendor of choice for our replacement system.

The system will be based around Seagates Nitro SSD cache configuration to get the best fit for our mixed workloads. We look forward to receiving the new hardware by the end of September 2017.

New RedHat test hosts available

2 new RedHat hosts have been installed in farm3 and are available for testing through the retest queue.

We are looking for feedback on this updated operating system as we are proposing to update our clusters later this financial year to RedHat throughout.

So please check now with your dev teams to ensure that your software is ready to go and the software stacks continue to run as intended.

New Teramem systems now live !

2 new teramem systems are now available through the teramem queue on farm3. These new hosts are quad socket 20 core units (80 cores in all) and provide 3TB of memory on each host. This is a significant boost to our existing hugemem environment where we continue to provide 256GB, 512GB and 1.5TB systems.

In addition to the high core and memory that these new bits provide, they also have approximately ~2TB  of NVMe mounted under /local/scratch01. This is a very fast local high IOP/s storage area that is idea for creating graph indexes, or general small file transactions.

To access these hosts, jobs will need to be submitted to the termed queue (-q teramem)  and only jobs > 750GB will currently be accepted into this queue. Currently the maximum job length is set to 15 days and the new kernel required to support the systems does not support the blcr checkpointing at this time. So please be aware, the that this restriction exists.

As always, any questions or comments, please let us know in the usual fashion.

Early Skylake server testing underway

Intel have kindly donated a full reference evaluation system so we can see what, if any, improvements our bio-informatics pipelines may realise on this new hardware platform. The model is the :

Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz

Intel’s ARK details are available here:

http://ark.intel.com/products/codename/37572/Skylake

and of course, wikipedia are here:

https://en.wikipedia.org/wiki/Skylake_(microarchitecture)

We are looking to gauge interest in their conjoined Skylake FPGA offering. If anyone is interested we can gain access to a test system.

AMD systems are also looking very interesting:

https://www.amd.com/en/events/naples-tech-day

We are looking to start evaluating their hardware very soon…. Finally a decent takeoff opportunity !

FCE 300% network performance improvement !

After working with Mellanox we have managed to realize a 300% network performance improvement across our flexible compute environment ! This is a huge impact on the service and dramatically reduces VM to VM communication.

A kernel upgrade was required for most of the uplift. It appears that the drivers and firmware provided in the default redHat kernel are sub-optimal.

Hashicorp course notes

So having attended the Washdays in London last week, we now have the slides from the Terraform and Consul courses available. They are available for Sanger Google account holders only and are available here:

http://preview.tinyurl.com/y7rqqakb

RENCI iRODS presentation now available on our presentations page

First hand experiences from a systems perspective on the recent upgrade process from 3.3.1 to 4.1.10. This presentation was delivered remotely to the RENCI iRODS user group meeting last week.

Presentations

 

OpenStack Upgrade Investigations Completed

We have evaluated the RedHat Openstack Newton release and we have confirmed that it is safe to proceed with an upgrade of our flexible compute environment (FCE) over the summer to this new release!

As part of the upgrade we have been able to incorporate CVX network support. This ads direct switch management through OpenStack, rather than the previously inelegant Neutron layer that OpenStack otherwise uses. This is the first big step towards rolling out new systems as bear metal via OpenStack!

Secure Lustre

Having worked with DDN, our team have managed to put together a lustre system that is multi-tenant capable. We have a video presentation link on our presentations page and the system is currently in alpha testing on our Flexible compute platform. Could this be the high performance POSIX filesystem that will help migrate existing HPC applications to the flexible compute platform ?!

Presentations