Storage and compute cluster

From techdocs
Revision as of 10:26, 24 October 2022 by Plinich (talk | contribs) (Created page with "The ''storage and compute cluster'' is intended as a replacement for CSE's VMWare and SAN infrastructure. Its primary functions are: # A resilient, redundant storage cluster consisting of cheap<nowiki>[-ish]</nowiki> rack-mounted storage nodes running Linux and Ceph using local SSD drives, and # Multiple compute nodes running Linux and QEMU/KVM acting as both secondary storage nodes and virtual machine hosts. These compute nodes are similar in CPU count and RAM...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The storage and compute cluster is intended as a replacement for CSE's VMWare and SAN infrastructure. Its primary functions are:

  1. A resilient, redundant storage cluster consisting of cheap[-ish] rack-mounted storage nodes running Linux and Ceph using local SSD drives, and
  2. Multiple compute nodes running Linux and QEMU/KVM acting as both secondary storage nodes and virtual machine hosts. These compute nodes are similar in CPU count and RAM as CSE's discrete login/VLAB servers.

Important implementation considerations:

  1. Data in the storage cluster is replicated in real-time across multiple hosts so that the failure of one or more nodes will not cause loss of data AND when any such failure occurs the cluster software (Ceph) will automatically rebuild instances of lost replicas on remaining nodes.
  2. Similarly, compute nodes are all similarly configured in terms of networking, CPU count and RAM and should any compute node fail, the virtual machines running on it can be migrated or restarted on another compute node without loss of functionality.
  3. While initially co-located in ther same data centre (K17), the intention is that storage and compute nodes can be distributed across multiple data centres (especially having two located at the UNSW Kensington campus) so a data centre failure, rather than just the failure of a single host, does not preclude restoring full operation of all hosted services.