Following the success stories from our @scale customers, we have been asked to provide an additional 1PB of usable capacity for our internal flexible compute environment.
The order has been placed and we expect BIOS-IT to have the hardware on site and acceptance tested but the end of September 2017.
We continue to be impressed by the resilience of both the Ceph and S3 services the the current platform has provided since Jan 2017 and we look forward to seeing the performance of the infrastructure continue to scale as additional units are added.
After many years of robust service, our Lustre scratch112 and 113 systems are approaching end of life. We are delighted to say that following significant testing and a procurement bakeoff, we have selected Seagate-Cray as the vendor of choice for our replacement system.
The system will be based around Seagates Nitro SSD cache configuration to get the best fit for our mixed workloads. We look forward to receiving the new hardware by the end of September 2017.
2 new RedHat hosts have been installed in farm3 and are available for testing through the retest queue.
We are looking for feedback on this updated operating system as we are proposing to update our clusters later this financial year to RedHat throughout.
So please check now with your dev teams to ensure that your software is ready to go and the software stacks continue to run as intended.
2 new teramem systems are now available through the teramem queue on farm3. These new hosts are quad socket 20 core units (80 cores in all) and provide 3TB of memory on each host. This is a significant boost to our existing hugemem environment where we continue to provide 256GB, 512GB and 1.5TB systems.
In addition to the high core and memory that these new bits provide, they also have approximately ~2TB of NVMe mounted under /local/scratch01. This is a very fast local high IOP/s storage area that is idea for creating graph indexes, or general small file transactions.
To access these hosts, jobs will need to be submitted to the termed queue (-q teramem) and only jobs > 750GB will currently be accepted into this queue. Currently the maximum job length is set to 15 days and the new kernel required to support the systems does not support the blcr checkpointing at this time. So please be aware, the that this restriction exists.
As always, any questions or comments, please let us know in the usual fashion.
Intel have kindly donated a full reference evaluation system so we can see what, if any, improvements our bio-informatics pipelines may realise on this new hardware platform. The model is the :
Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz
Intel’s ARK details are available here:
and of course, wikipedia are here:
We are looking to gauge interest in their conjoined Skylake FPGA offering. If anyone is interested we can gain access to a test system.
AMD systems are also looking very interesting:
We are looking to start evaluating their hardware very soon…. Finally a decent takeoff opportunity !
After working with Mellanox we have managed to realize a 300% network performance improvement across our flexible compute environment ! This is a huge impact on the service and dramatically reduces VM to VM communication.
A kernel upgrade was required for most of the uplift. It appears that the drivers and firmware provided in the default redHat kernel are sub-optimal.
So having attended the Washdays in London last week, we now have the slides from the Terraform and Consul courses available. They are available for Sanger Google account holders only and are available here:
A quick reminder to all Sanger OpenStack staff, we are maintaining a local reference list for those new to OpenStack or developing upon it here:
First hand experiences from a systems perspective on the recent upgrade process from 3.3.1 to 4.1.10. This presentation was delivered remotely to the RENCI iRODS user group meeting last week.
We have evaluated the RedHat Openstack Newton release and we have confirmed that it is safe to proceed with an upgrade of our flexible compute environment (FCE) over the summer to this new release!
As part of the upgrade we have been able to incorporate CVX network support. This ads direct switch management through OpenStack, rather than the previously inelegant Neutron layer that OpenStack otherwise uses. This is the first big step towards rolling out new systems as bear metal via OpenStack!