Storage and compute cluster

From techdocs
Jump to navigation Jump to search

The storage and compute cluster is intended as a replacement for CSE's VMWare and SAN infrastructure. Its primary functions are:

  1. A resilient, redundant storage cluster consisting of multiple cheap[-ish] rack-mounted storage nodes running Linux and Ceph using multiple local SSD drives, and
  2. Multiple compute nodes running Linux and QEMU/KVM acting as both secondary storage nodes and virtual machine hosts. These compute nodes are similar in CPU count and RAM as CSE's discrete login/VLAB servers.

Important implementation considerations

  1. Data in the storage cluster is replicated in real-time across multiple hosts so that the failure of one or more nodes will not cause loss of data AND when any such failure occurs the cluster software (Ceph) will automatically rebuild instances of lost replicas on remaining nodes.
  2. Similarly, compute nodes are all similarly configured in terms of networking, CPU count and RAM and should any compute node fail, the virtual machines running on it can be migrated or restarted on another compute node without loss of functionality.
  3. While initially co-located in ther same data centre (K17), the intention is that storage and compute nodes can be distributed across multiple data centres (especially having two located at the UNSW Kensington campus) so a data centre failure, rather than just the failure of a single host, does not preclude restoring full operation of all hosted services.

Concept

Broad concept of the storage and compute cluster

The diagram at the right shows the basic concept of the cluster.

  1. The primary storage and compute nodes have 10Gb network interfaces used to access and maintain the data store.
  2. There are two network switches providing redundancy for the storage node network traffic and ensuring that at least one compute node will have access to the data store in case of switch failure.
  3. Management of the cluster happens via a separate subnetwork to the data store's own traffic network.

Additional and unlimited storage and compute nodes can be added.

Of course, if/when the cluster is decentralised, the networking will have to be revisited according to the networking available at the additional data centres.

What Ceph provides

Fundamentally, Ceph provides a redundant, distributed data store on top of physical servers running Linux. Ceph itself manages replication, patrol reads/scrubbing and maintaining redundancy ensuring data remains automatically and transparently available in the face of disk, host and/or site failures.

The data is presented for use primarily, but not exclusively, as network-accessible raw block devices (RADOS Block Devices or RBD's) for use as filesystem volumes or boot devices (the latter especially by virtual machines); and as mounted network file systems (CephFS) similar to NFS. In both cases the data backing the device or file system is replicated across the cluster.

Each storage and compute node typically runs one or more Ceph daemons while other servers accessing the data stored in the cluster will typically only have the Ceph libraries and utilities installed.

The Ceph daemons are:

  1. mon - monitor daemon - multiple instances across multiple nodes maintaining status and raw data replica location information. Ceph uses a quorum of mon's to avoid split brain issues and thus the number of mon's should be an odd number greater than or equal to three.
  2. mgr - manager daemon - one required to assist the mon's. Thus, it's useful to have two in case one dies.
  3. mds - a metadata server for CephFS (such as file permission bits, ownership, time stamps, etc.). The more the merrier and, obviously, two or more is a good thing.
  4. osd - the object store daemon. One per physical disk so there can be more than one on a physical host. Handles maintaining, replicating, validating and serving the data on the disk. OSD's talk to themselves a lot.

What QEMU/KVM provides

  1. Machine virtualisation
  2. Limited virtual networking which supplemented by the local host's own networking/bridging/firewalling

See:

Installing and configuring the physical servers to run Linux and Ceph

Servers are Dell.

  1. On the RAID controller:
    • Configure two disks for the boot device as RAID1 (mirrored)
    • Configure the remaining data store disks as RAID0 with one single component disk each
  2. Use eth0 to do a network install of Debian Bullseye selecting:
    • Static network addressing
    • SSH Server only
    • Everything installed in the one partition
    • Configure timezone, keyboard, etc.
  3. Reboot into installed OS
  4. Fix/enable root login
  5. Changee sshd_config to include
    • UsePAM no
    • PasswordAuthentication no
  6. Install packages:
    • apt-get install wget gnupg man-db lsof strace tcpdump iptables-persistent rsync psmisc software-properties-common
    • apt-get install python2.7-minimal (for utilities in /root/bin)
  7. Ensure unattended upgrades are installed and enabled (dpkg-reconfigure unattended-upgrades)

Install the Ceph software packages:

Refer to the manual installation procedures at Installation (Manual).

  1. wget -q -O- 'https://download.ceph.com/keys/release.asc' | apt-key add -
  2. apt-add-repository 'deb https://download.ceph.com/debian-quincy/ bullseye main'
  3. apt-get install ceph

When first creating a Ceph cluster you need to the following ONCE. Once the cluster is running no further bootstrapping is required. See further down for how to ADD a mon to an already-running cluster.

  1. Perform Monitor Bootstrapping (Monitor bootstrapping)
  2. You may need to run chown -R ceph:ceph /var/lib/ceph/mon/ceph-<host> to get the mon started the first time.
  3. systemctl enable ceph-mon@<host>

The initial /etc/ceph/ceph.conf file will look a bit like this:

[global]
fsid = db5b6a5a-1080-46d2-974a-80fe8274c8ba
mon initial members = storage00
mon host = 129.94.242.95

auth cluster required = cephx
auth service required = cephx
auth client required = cephx

[mon.storage00]
host = storage00
mon addr = 129.94.242.95

Set up a mgr daemon:

See also Ceph-mgr administration guide.

  1. mkdir /var/lib/ceph/mgr/ceph-<host>
  2. ceph auth get-or-create mgr.<host> mon 'allow profile mgr' osd 'allow *' mds 'allow *' > /var/lib/ceph/mgr/ceph-<host>/keyring
  3. chown -R ceph:ceph /var/lib/ceph/mgr/ceph-<host>
  4. systemctl start ceph-mgr@<host>
  5. systemctl enable ceph-mgr@<host>

Some run time configuration related to running mixed-version Ceph environments (which we hopefully don't do):

  1. ceph config set mon mon_warn_on_insecure_global_id_reclaim true
  2. ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed true
  3. ceph config set mon auth_allow_insecure_global_id_reclaim false
  4. ceph mon enable-msgr2

Check Ceph cluster-of-one-node status:

  1. ceph status

Ceph software and configuration common across all storage and compute nodes

In a disaster situation, the storage and compute nodes need to be able to run without depending on any additional network services. Thus site network configuration, necessary utility scripts and any other important files are duplicated on all nodes. Principally these are:

File name Description Action to take manually when changed
/etc/hosts Node network names and IP addresses
/etc/ceph/ceph.conf Main Ceph configuration file. Used by all Ceph daemons and libraries, and very importantly by the mon daemons Restart the mon daemons (one at a time, not all at once! Keep an eye on ceph status while doing so)
/etc/ceph/ceph.client.admin.keyring Access key used by tools and utilities to grant administrative access to the cluster
/root/bin/ceph_* CSG-provided utilities to maintain the cluster
/etc/iptables/rules.v4 iptables/netfilter rules to protect the cluster Run /usr/sbin/netfilter-persistent start

Adding additional storage nodes

  1. Follow the steps above for installing and configuring Linuux up to using apt-get to install the Ceph packages.
  2. Update /root/bin/ceph_distribute_support_files and /etc/hosts on a running node to include the new node.
  3. On the same already-running node, run /root/bin/ceph_distribute_support_files.
  4. To add storage, select an unused OSD number and on the new node run:
    • ceph_create_quincy_osd /dev/<blockdevice> <osdnum>
    • systemctl start ceph-osd@<osdnum>
    • systemctl enable ceph-osd@<osdnum>

Add additional mon daemons

See also Adding/removing monitors.

Note: A maximum of one mon and one mgr are allowed per server/node.

# mkdir /var/lib/ceph/mon/ceph-<hostname>
# ceph auth get mon. -o /tmp/keyfile
# ceph mon getmap -o /tmp/monmap
# ceph-mon -i <hostname> --mkfs --monmap /tmp/monmap --keyring /tmp/keyfile
# chown -R ceph:ceph /var/lib/ceph/mon/ceph-<hostname>
# systemctl start ceph-mon@<hostname>
# systemctl enable ceph-mon@<hostname>

Other Ceph configuration things

=== Set initial OSD weights. As all disks are the same size, use 1.0 until such time as other sizes are present.

Examples:

  1. ceph osd crush set osd.0 1.0 host=storage00
  2. ceph osd crush set osd.1 1.0 host=storage00
  3. ceph osd crush set osd.2 1.0 host=storage01
  4. ceph osd crush set osd.3 1.0 host=storage01
  5. ceph osd crush set osd.4 1.0 host=compute01
  6. ceph osd crush set osd.5 1.0 host=compute01

Set up the crushmap hierarchy to allow space to be allocated on a per-host basis (as opposed to per-OSD)

Not to be played with lightly. See CRUSH maps.

Example commands:

  1. bin/ceph_show_crushmap
  2. ceph osd crush move storage00 root=default
  3. ceph osd crush move storage01 root=default
  4. ceph osd crush move compute01 root=default

Create an RBD pool to used for RBD devices(RADOS Block Devices)

See:

  1. ceph osd pool create rbd 128 128
  2. rbd pool init rbd

Add MDS daemons for CephFS on two nodes (so we have two of them in case one fails)

Like this:

  1. mkdir -p /var/lib/ceph/mds/ceph-<hostname>
  2. ceph-authtool --create-keyring /var/lib/ceph/mds/ceph-<hostname>/keyring --gen-key -n mds.<hostname>
  3. ceph auth add mds.<hostname> osd "allow rwx" mds "allow *" mon "allow profile mds" -i /var/lib/ceph/mds/ceph-<hostname>/keyring
  4. chown -R ceph:ceph /var/lib/ceph/mds/ceph-<hostname>
  5. systemctl start ceph-mds@<hostname>
  6. systemctl status ceph-mds@<hostname>

Update /etc/ceph/ceph.conf and distribute.

Create the transportable "vm" file system where the qemu scripts for each VM will go

  1. ceph osd pool create vmfs_data 128 128
  2. ceph osd pool create vmfs_metadata 128 128
  3. ceph fs new vm vmfs_metadata vmfs_data

Building a compute node

Note: Virtual machines run with their console available via VNC only via the loopback interface (127.0.0.1). Thus, one must use an SSH tunnel to access the consoles.

  • Follow steps outlined above to do a basic setup of the server and install the Ceph software on it. Then,
  • Follow steps 1 thru 3 in Adding additional storage nodes (above).

Run:

  1. apt-get install qemu-system-x86 qemu-utils qemu-block-extra
  2. apt-get install bridge-utils

Set up /etc/network/interfaces. Use the following for inspiration on how to set up bridging:

# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

source /etc/network/interfaces.d/*

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
# allow-hotplug eth0
# iface eth0 inet static
auto br385
iface br385 inet static
	address 129.94.242.147/24
	gateway 129.94.242.1
	dns-nameservers 129.94.242.2
	dns-search cse.unsw.edu.au
	bridge_ports eth0

Configure qemu to allow virtual machines to use bridge devices

  1. mkdir /etc/qemu
  2. echo "allow br385" > /etc/qemu/bridge.conf

/root/bin/ceph_mount_qemu_machines

#!/bin/bash

mount -t ceph -o name=admin :/ /usr/local/qemu_machines && df -m

See /usr/local/qemu_machines.

/usr/local/qemu_machines/common/start_vm

#!/bin/bash

cd `dirname "$0"` || exit 1

if [ "a$MEM" = "a" ]; then MEM="1G"; fi
if [ "a$MAC" = "a" ]; then
	MAC="56"
	MAC="$MAC:`printf "%02x" $(( $RANDOM & 0xff ))`"
	MAC="$MAC:`printf "%02x" $(( $RANDOM & 0xff ))`"
	MAC="$MAC:`printf "%02x" $(( $RANDOM & 0xff ))`"
	MAC="$MAC:`printf "%02x" $(( $RANDOM & 0xff ))`"
#	MAC="$MAC:`printf "%02x" $(( $RANDOM & 0xff ))`"
fi
if [ "a$BOOT" = "a" ]; then BOOT="c"; fi
if [ "a$NICE" = "a" ]; then NICE="0"; fi
if [ "a$CPU" = "a" ]; then CPU="host"; fi
if [ "a$KVM" = "akvm" ]; then KVM="-enable-kvm -cpu $CPU"; else KVM=""; fi
if [ "a$SMP" != "a" ]; then SMP="-smp $SMP"; fi
if [ "a$MACHINE" != "a" ]; then MACHINE="-machine $MACHINE"; else MACHINE=""; fi
if [ "a$NODAEMON" = "a" ]; then DAEMONIZE="-daemonize"; else DAEMONIZE=""; fi

n=0
DISKS=""
while true; do
	eval d=\$DISK$n
	if [ "a$d" = "a" ]; then break; fi
	if [ "a$DISKS" != "a" ]; then DISKS="${DISKS} "; fi
	DISKS="${DISKS}-drive file=$d,if=virtio"
	n=$(( $n + 1 ))
done

n=0
NETIFS=""
while true; do
	eval d=\$NETIF$n
	if [ "a$d" = "a" ]; then break; fi
	if [ "a$NETIFS" != "a" ]; then NETIFS="${NETIFS} "; fi
	NETIFS="${NETIFS}-netdev bridge,id=hostnet$n,br=$d -device virtio-net-pci,netdev=hostnet$n,mac=$MAC:`printf "%02x" $n`"
	n=$(( $n + 1 ))
done

exec	/usr/bin/nice -n $NICE					\
	/usr/bin/qemu-system-x86_64				\
	-vnc		127.0.0.1:$NUM				\
			$DISKS					\
			$MACHINE				\
			$KVM					\
			$SMP					\
			$NETIFS					\
	-m		$MEM					\
	-boot		$BOOT					\
	-serial		none					\
	-parallel	none					\
			$EXTRAARGS				\
	$DAEMONIZE

exit

/usr/local/qemu_machines/debianminimal/start_vm

#!/bin/bash

#BOOT="d"
#EXTRAARGS="-cdrom /usr/local/qemu_machines/isos/debian-11.5.0-amd64-netinst.iso"

NUM=1
KVM="kvm"
DISK0="rbd:rbd/debianminimal-disk0,format=rbd"
NETIF0="br385"
#NODAEMON="y"

. /usr/local/qemu_machines/common/start_vm

Some info about RADOS Block Devices

See Ceph block device.

RADOS Block Devices (RBD's) are network block devices created in a Ceph storage cluster which are available as block devices to clients of the cluster, typically to provide block storage -- e.g., boot disks, etc. -- to virtual machines running on hosts in the cluster.

RBD's are created in a Ceph storage pool which is, when the pool is created, associated with the Ceph's internal "rbd" application. By convention this pool is typically called… <drum roll, please!>… "rbd".

The start_vm script above shows how to refer to an RBD when starting qemu.

Here are some useful commands to get you started:

Command Description
qemu-img create -f raw rbd:rbd/test-disk0 50G Create an RBD using qemu-img which is 50G in size (this can be expanded later)
ceph osd pool ls Get a list of all Ceph storage pools
ceph os pool ls detail Get full details of all Ceph storage pools including replication sizes, striping, hashing, replica distribution policy, etc.
rbd ls -l Get a list of the RBD's and their sizes in the default RBD storage pool (namely "rbd")
rbd -h The start of a busy morning reading documentation

Rebooting nodes - do so with care

The nodes in the storage cluster should not be rebooted more than one at a time. The data in the cluster is replicated across the nodes and a minimum number of replicas need to be online at all times. Taking out more than one node at a time can/(usually does) mean that insufficient replicas of the data may be available and the cluster will grind to a halt (including all the VMs relying on it).

It's not a major catastrophe (depending on how you define "major") when that happens as the cluster will regroup once the nodes come back online, but VMs depending on the cluster for, say, their root filesystem will die due to built-in Linux kernel timeouts after two minutes of no disk. In this case the VM invariably needs to be rebooted.

Ideally, you should follow this sequence when rebooting any node which contributes storage to the pool (i.e., storage00, storage01 and compute01):

  1. Pick a good time,
  2. Run ceph osd set noout on any node. This tells Ceph that there's going to be a temporary loss of a node and that it shouldn't start rebuilding the missing data on remaining nodes because the node is coming back soon,
  3. Run systemctl stop ceph-osd@<osdnum> on the node you're planning to reboot to stop each OSD running on the node, replacing <osdnum> with the relevant OSD numbers,
  4. Wait about ten seconds and then run "ceph status" on any node and make sure Ceph isn't having conniptions due to the missing OSD's. If it simply reports that the cluster is OK and there's just some degradation then you're good to go. Yiou may need to do this a few times until Ceph settles,
  5. Reboot the node,
  6. Keep an eye on ceph status on any other node and soon after the rebooted node comes back on line the degradation should go away and there'll just the warning about the "noout",
  7. When you've finished with the reboots, run ceph osd unset noout to reset the "noout" flag. ceph status" should say everything is OK.