Additional 3.5PB S3/Ceph storage available for use

We are happy to announce that the final installment of the requested additional 3.5PB usable Ceph / S3 storage is both fully installed and is now available for use by our Human genetics groups.

The turnaround time from delivery to hand over has been less than 1 month for all storage sets and despite some last mile hardware issues, the service is running well.

More Ceph capacity for the Flexible Compute Environment

Following the success stories from our @scale customers, we have been asked to provide an additional 1PB of usable capacity for our internal flexible compute environment.

The order has been placed and we expect BIOS-IT to have the hardware on site and acceptance tested but the end of September 2017.

We continue to be impressed by the resilience of both the Ceph and S3 services the the current platform has provided since Jan 2017 and we look forward to seeing the performance of the infrastructure continue to scale as additional units are added.

New hardware incoming Lustre 112/113 upgrade time

After many years of robust service, our Lustre scratch112 and 113 systems are approaching end of life. We are delighted to say that following significant testing and a procurement bakeoff, we have selected Seagate-Cray as the vendor of choice for our replacement system.

The system will be based around Seagates Nitro SSD cache configuration to get the best fit for our mixed workloads. We look forward to receiving the new hardware by the end of September 2017.

New RedHat test hosts available

2 new RedHat hosts have been installed in farm3 and are available for testing through the retest queue.

We are looking for feedback on this updated operating system as we are proposing to update our clusters later this financial year to RedHat throughout.

So please check now with your dev teams to ensure that your software is ready to go and the software stacks continue to run as intended.

New Teramem systems now live !

2 new teramem systems are now available through the teramem queue on farm3. These new hosts are quad socket 20 core units (80 cores in all) and provide 3TB of memory on each host. This is a significant boost to our existing hugemem environment where we continue to provide 256GB, 512GB and 1.5TB systems.

In addition to the high core and memory that these new bits provide, they also have approximately ~2TB  of NVMe mounted under /local/scratch01. This is a very fast local high IOP/s storage area that is idea for creating graph indexes, or general small file transactions.

To access these hosts, jobs will need to be submitted to the termed queue (-q teramem)  and only jobs > 750GB will currently be accepted into this queue. Currently the maximum job length is set to 15 days and the new kernel required to support the systems does not support the blcr checkpointing at this time. So please be aware, the that this restriction exists.

As always, any questions or comments, please let us know in the usual fashion.

Early Skylake server testing underway

Intel have kindly donated a full reference evaluation system so we can see what, if any, improvements our bio-informatics pipelines may realise on this new hardware platform. The model is the :

Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz

Intel’s ARK details are available here:

http://ark.intel.com/products/codename/37572/Skylake

and of course, wikipedia are here:

https://en.wikipedia.org/wiki/Skylake_(microarchitecture)

We are looking to gauge interest in their conjoined Skylake FPGA offering. If anyone is interested we can gain access to a test system.

AMD systems are also looking very interesting:

https://www.amd.com/en/events/naples-tech-day

We are looking to start evaluating their hardware very soon…. Finally a decent takeoff opportunity !

FCE 300% network performance improvement !

After working with Mellanox we have managed to realize a 300% network performance improvement across our flexible compute environment ! This is a huge impact on the service and dramatically reduces VM to VM communication.

A kernel upgrade was required for most of the uplift. It appears that the drivers and firmware provided in the default redHat kernel are sub-optimal.

Hashicorp course notes

So having attended the Washdays in London last week, we now have the slides from the Terraform and Consul courses available. They are available for Sanger Google account holders only and are available here:

http://preview.tinyurl.com/y7rqqakb

RENCI iRODS presentation now available on our presentations page

First hand experiences from a systems perspective on the recent upgrade process from 3.3.1 to 4.1.10. This presentation was delivered remotely to the RENCI iRODS user group meeting last week.

Presentations