Storage and compute cluster
The storage and compute cluster is intended as a replacement for CSE's VMWare and SAN infrastructure. Its primary functions are:
- A resilient, redundant storage cluster consisting of multiple cheap[-ish] rack-mounted storage nodes running Linux and Ceph using multiple local SSD drives, and
- Multiple compute nodes running Linux and QEMU/KVM acting as both secondary storage nodes and virtual machine hosts. These compute nodes are similar in CPU count and RAM as CSE's discrete login/VLAB servers.
Important implementation considerations
- Data in the storage cluster is replicated in real-time across multiple hosts so that the failure of one or more nodes will not cause loss of data AND when any such failure occurs the cluster software (Ceph) will automatically rebuild instances of lost replicas on remaining nodes.
- Similarly, compute nodes are all similarly configured in terms of networking, CPU count and RAM and should any compute node fail, the virtual machines running on it can be migrated or restarted on another compute node without loss of functionality.
- While initially co-located in ther same data centre (K17), the intention is that storage and compute nodes can be distributed across multiple data centres (especially having two located at the UNSW Kensington campus) so a data centre failure, rather than just the failure of a single host, does not preclude restoring full operation of all hosted services.
Concept
The diagram at the right shows the basic concept of the cluster.
- The primary storage and compute nodes have 10Gb network interfaces used to access and maintain the data store.
- There are two network switches providing redundancy for the storage node network traffic and ensuring that at least one compute node will have access to the data store in case of switch failure.
- Management of the cluster happens via a separate subnetwork to the data store's own traffic network.
Additional and unlimited storage and compute nodes can be added.
Of course, if/when the cluster is decentralised, the networking will have to be revisited according to the networking available at the additional data centres.
What Ceph provides
Fundamentally, Ceph provides a redundant, distributed data store on top of physical servers running Linux. Ceph itself manages replication, patrol reads/scrubbing and maintaining redundancy ensuring data remains automatically and transparently available in the face of disk, host and/or site failures.
The data is presented for use primarily, but not exclusively, as network-accessible raw block devices (RADOS Block Devices or RBD's) for use as filesystem volumes or boot devices (the latter especially by virtual machines); and as mounted network file systems (CephFS) similar to NFS. In both cases the data backing the device or file system is replicated across the cluster.
Each storage and compute node typically runs one or more Ceph daemons while other servers accessing the data stored in the cluster will typically only have the Ceph libraries and utilities installed.
The Ceph daemons are:
- mon - monitor daemon - multiple instances across multiple nodes maintaining status and raw data replica location information. Ceph uses a quorum of mon's to avoid split brain issues and thus the number of mon's should be an odd number greater than or equal to three.
- mgr - manager daemon - one required to assist the mon's. Thus, it's useful to have two in case one dies.
- mds - a metadata server for CephFS (such as file permission bits, ownership, time stamps, etc.). The more the merrier and, obviously, two or more is a good thing.
- osd - the object store daemon. One per physical disk so there can be more than one on a physical host. Handles maintaining, replicating, validating and serving the data on the disk. OSD's talk to themselves a lot.
What QEMU/KVM provides
- Machine virtualisation
- Limited virtual networking which supplemented by the local host's own networking/bridging/firewalling
Installing and configuring the physical servers to run Linux and Ceph
Servers are Dell.
- On the RAID controller:
- Configure two disks for the boot device as RAID1 (mirrored)
- Configure the remaining data store disks as RAID0 with one single component disk each
- Use eth0 to do a network install of Debian Bullseye selecting:
- Static network addressing
- SSH Server only
- Everything installed in the one partition
- Configure timezone, keyboard, etc.
- Reboot into installed OS
- Fix/enable root login
- Changee
sshd_config
to includeUsePAM no
PasswordAuthentication no
- Install packages:
apt-get install wget gnupg man-db lsof strace tcpdump iptables-persistent rsync software-properties-common
- Ensure
unattended upgrades
are installed and enabled (dpkg-reconfigure unattended-upgrades
)
Install the Ceph software packages:
Refer to the manual installation procedures at Installation (Manual).
wget -q -O- 'https://download.ceph.com/keys/release.asc' | apt-key add -
apt-add-repository 'deb https://download.ceph.com/debian-quincy/ bullseye main'
apt-get install ceph
When first creating a Ceph cluster you need to the following ONCE. Once the cluster is running no further bootstrapping is required. See further down for how to ADD a mon to an already-running cluster.
- Perform Monitor Bootstrapping (Monitor bootstrapping)
- You may need to run
chown -R ceph:ceph /var/lib/ceph/mon/ceph-<host>
to get the mon started the first time. systemctl enable ceph-mon@<host>
The initial /etc/ceph/ceph.conf
file will look a bit like this:
[global] fsid = db5b6a5a-1080-46d2-974a-80fe8274c8ba mon initial members = storage00 mon host = 129.94.242.95 auth cluster required = cephx auth service required = cephx auth client required = cephx [mon.storage00] host = storage00 mon addr = 129.94.242.95
Set up a mgr daemon:
mkdir /var/lib/ceph/mgr/ceph-<host>
ceph auth get-or-create mgr.<host> mon 'allow profile mgr' osd 'allow *' mds 'allow *' > /var/lib/ceph/mgr/ceph-<host>/keyring
chown -R ceph:ceph /var/lib/ceph/mgr/ceph-<host>
systemctl start ceph-mgr@<host>
systemctl enable ceph-mgr@<host>
Some run time configuration related to running mixed-version Ceph environments (which we hopefully don't do):
ceph config set mon mon_warn_on_insecure_global_id_reclaim true
ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed true
ceph config set mon auth_allow_insecure_global_id_reclaim false
ceph mon enable-msgr2
Check Ceph cluster-of-one-node status:
ceph status
Software and configuration common across all storage and compute nodes
In a disaster situation, the storage and compute nodes need to be able to run without depending on any additional network services. Thus site network configuration, necessary utility scripts and any other important files are duplicated on all nodes. Principally these are:
File name | Description | Action to take manually when changed |
---|---|---|
/etc/hosts | Node network names and IP addresses | |
/etc/ceph/ceph.conf | Main Ceph configuration file. Used by all Ceph daemons and libraries, and very importantly by the mon daemons | Restart the mon daemons (one at a time, not all at once! Keep an eye on ceph status while doing so)
|
/etc/ceph/ceph.client.admin.keyring | Access key used by tools and utilities to grant administrative access to the cluster | |
/root/bin/ceph_* | CSG-provided utilities to maintain the cluster | |
/etc/iptables/rules.v4 | iptables/netfilter rules to protect the cluster | Run /usr/sbin/netfilter-persistent start
|
Adding additional storage nodes
- Follow the steps above for installing and configuring Linuux up to using
apt-get
to install the Ceph packages. - Update
/root/bin/ceph_distribute_support_files
and/etc/hosts
on a running node to include the new node. - On the same already-running node, run
/root/bin/ceph_distribute_support_files
. - To add storage, select an unused OSD number and on the new node run:
ceph_create_quincy_osd /dev/<blockdevice> <osdnum>
systemctl start ceph-osd@<osdnum>
systemctl enable ceph-osd@<osdnum>
Add additional mon daemons
Note: A maximum of one mon and one mgr are allowed per server/node.
# mkdir /var/lib/ceph/mon/ceph-<hostname> # ceph auth get mon. -o /tmp/keyfile # ceph mon getmap -o /tmp/monmap # ceph-mon -i <hostname> --mkfs --monmap /tmp/monmap --keyring /tmp/keyfile # chown -R ceph:ceph /var/lib/ceph/mon/ceph-<hostname> # systemctl start ceph-mon@<hostname> # systemctl enable ceph-mon@<hostname>