Ceph
Where to find docs
The official Ceph documentation is located on https://docs.ceph.com/en/latest/rados/operations/
It is strongly advised to use the documentation for the version being used.
- Pacific - https://docs.ceph.com/en/pacific/rados/operations/
- Quincy - https://docs.ceph.com/en/quincy/rados/operations/
- Reef - https://docs.ceph.com/en/reef/rados/operations/
Do not take information in the documentation at face value. Especially when it comes to advanced/rarely used/very new features it is strongly advised to test any claims made in the documentation about any particular feature.
Never assume that things will work as written without actually testing it on a test setup as close to your real workload scenario as possible.
Advice on Ceph releases
The current Ceph releases and their support status can be found on https://docs.ceph.com/en/latest/releases/
When a new Ceph stable version is released you are strongly advised to not roll it out on any production cluster whatsoever. Even though its listed as "stable" it doesn't mean that this is actually true. Especially avoid using .0 releases on anything remotely production unless you really, really now what you're doing and can live with a possible catastrophic failure.
Be very conservative about what version you run on production systems.
Shiny new features aren't worth the risk of total or partial data loss/corruption.
General maintenance
60 seconds cluster overview
The following commands can be used to quickly check the status of Ceph:
-
Print overall cluster status
ceph -s
-
Print detailed health information
ceph health detail
-
Display current OSD tree
ceph osd tree
-
Cluster storage usage by pool and storage class
ceph df
-
List pools with detailed configuration
ceph osd pool ls detail
-
Get usage stats for OSDs
ceph osd df {plain|tree} {class e.g. hdd|ssd}
-
Watch Ceph health messages sequentially
ceph -w
-
List daemon versions running in the cluster
ceph versions
Also you can run the following on each node running ceph-daemons, to provide further debug information about the environment:
# lscpu
# cat /proc/cpuinfo # if lscpu isn't available
# free -g
# ip l
# ethtool <device> # for each network adapter
Mute/Unmute a health warning
$ ceph health mute <what> <duration>
$ ceph health unmute <what>
Disable/Enable (deep-)scrubbing
$ ceph osd set noscrub
$ ceph osd set nodeep-scrub
$ ceph osd unset noscrub
$ ceph osd unset nodeep-scrub
Use this sparingly only in emergency situations. Setting these flags will cause a HEALTH_WARN status, increase risk of data corruption and also the risk of generating a HEALTH_WARN due to PGs not being (deep-)scrubbed in time.
Reboot a single node
The traditional way of doing this is by setting the noout
flag,
do the appropriate maintenance work and after the node is back online
unset the flag like so:
ceph osd set noout
After maintenance is done and host is back up:
ceph osd unset noout
On versions Luminous or above you can set the flag individually for single OSDs or entire CRUSH buckets, which can be a safer option in case of prolonged maintenance periods.
Add noout for a OSD:
ceph osd add-noout osd.<ID>
Remove noout for a OSD:
ceph osd rm-noout osd.<ID>
Add noout for CRUSH bucket (e.g. host name as seen in ceph osd tree
):
ceph osd set-group noout <crush-bucket-name>
Remove noout for CRUSH bucket:
ceph osd unset-group noout <crush-bucket-name>
Gathering information about block devices
Enumerate typical storage devices and LVM
# lsblk
# lsblk -S
# lsscsi
# nvme list
# pvs
# vgs
# lvs
SMART data for SATA/SAS and NVME devices
# smartctl -a /dev/sdX
# nvme smart-log /dev/nvmeXnY
Check format of a NVME device
# nvme id-ns -H /dev/nvmeXnY
Check the last lines named "LBA Format". It will show which formats are supported, which format is in use and which format offers the best performance according to the vendor.
Format a NVME device to a different LBA format using nvme-cli
This will destroy all data on the device!
# nvme format --lbaf=<id> /dev/nvmeXnY
Secure Erase a NVME drive using nvme-cli
This will destroy all data on the device!
# nvme format -s2 /dev/nvmeXnY
# blkdiscard /dev/nvmeXnY
# nvme format -s1 /dev/nvmeXnY
Secure Erase a SATA/SAS drive using hdparm
This will destroy all data on the device!
-
Gather device info:
# hdparm -I /dev/sdX
Check that the output says "not frozen" and "not locked", also it should list support for enhanced erase and list time estimates for SECURITY ERASE UNIT and/or ENHANCED SECURITY ERASE UNIT
-
Set a master password for the disk (required, will be automatically removed after wipe)
# hdparm --user-master wipeit --security-set-pass wipeit /dev/sdX
# hdparm -I /dev/sdXCheck that "Security level" is now "high" and master password is now "enabled" instead of "not enabled" before
-
Wipe the device
If device supports enhanced security erase (better), use the following:
# hdparm --user-master wipeit --security-erase-enhanced wipeit /dev/sdX
If not, use standard security erase:
# hdparm --user-master wipeit --security-erase wipeit /dev/sdX
On some systems the system firmware might "freeze" the device, which makes it impossible to issue a secure erase or reformat the device. In that case it might be necessary to either "unfreeze" the drive or to install the drive in another system where it can be unfrozen. Also make sure that the device is actually wiped. Its recommended to at least perform a blanking pass on HDDs with a tool like nwipe.
OSD maintenance tasks
Locate a specific OSD in the cluster
$ ceph osd find osd.<ID>
Get OSD metadata (global and single OSD)
$ ceph osd metadata
$ ceph osd metadata osd.<ID>
Interesting fields:
- bluefs_db_rotational
- bluefs_dedicated_db
- bluefs_dedicated_wal
- bluefs_wal_rotational
- bluestore_bdev_rotational
- device_ids
- device_paths
- devices
- hostname
- osd_objectstore
- rotational
Add a new OSD
-
Prepare the configuration for the new OSD first. Details on adding the configuration for a new OSD in the Ceph configuration guide.
-
Deploy the new OSD service on
<nodename>
.osism apply ceph-osds -l <nodename> -e ceph_handler_osds_restart=false
Replace a defect OSD
Remove a OSD
As with ‘Remove a single OSD node’. Except that the steps are only executed for a single OSD and the node is not removed from the CRUSH map and the inventory. Only the entries relating to the removed OSD are removed from the host vars.
Manual way
$ ceph osd crush reweight osd.<ID> 0.0
# Wait for rebalance to complete...
$ ceph osd out osd.<ID>
# systemctl stop ceph-osd@<ID>
# systemctl disable ceph-osd@<ID>
$ ceph osd purge osd.<ID> --yes-i-really-mean-it
The LV and VG defined in the inventory for this OSD must also be removed. The OSD itself should be wiped.
Remove a single OSD node
-
Get all OSDs of the node
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.11691 root default
-3 0.03897 host testbed-node-0
0 hdd 0.01949 osd.0 up 1.00000 1.00000
4 hdd 0.01949 osd.4 up 1.00000 1.00000
-5 0.03897 host testbed-node-1
1 hdd 0.01949 osd.1 up 1.00000 1.00000
3 hdd 0.01949 osd.3 up 1.00000 1.00000
-7 0.03897 host testbed-node-2
2 hdd 0.01949 osd.2 up 1.00000 1.00000
5 hdd 0.01949 osd.5 up 1.00000 1.00000 -
Reduce the weighting of all OSDs on the node to 0. Do this for each OSD in a row and wait after each adjustment until the Ceph cluster is balanced. Depending on how large the Ceph cluster and the individual OSDs are, this may take some time.
$ ceph osd crush reweight osd.2 0.0
$ ceph osd crush reweight osd.5 0.0The Ceph OSDs that are to be removed then have a weight of 0.
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.07794 root default
-3 0.03897 host testbed-node-0
0 hdd 0.01949 osd.0 up 1.00000 1.00000
4 hdd 0.01949 osd.4 up 1.00000 1.00000
-5 0.03897 host testbed-node-1
1 hdd 0.01949 osd.1 up 1.00000 1.00000
3 hdd 0.01949 osd.3 up 1.00000 1.00000
-7 0 host testbed-node-2
2 hdd 0 osd.2 up 1.00000 1.00000
5 hdd 0 osd.5 up 1.00000 1.00000 -
Remove the OSDs and everything that belongs to them from the node. This is a disruptive action that cannot be undone. The devices used are also reset.
$ osism apply ceph-shrink-osd -e ireallymeanit=yes -e osd_to_kill=2,5
All OSDs were removed.
$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.07794 root default
-3 0.03897 host testbed-node-0
0 hdd 0.01949 osd.0 up 1.00000 1.00000
4 hdd 0.01949 osd.4 up 1.00000 1.00000
-5 0.03897 host testbed-node-1
1 hdd 0.01949 osd.1 up 1.00000 1.00000
3 hdd 0.01949 osd.3 up 1.00000 1.00000
-7 0 host testbed-node-2 -
Remove the node from the CRUSH map.
$ ceph osd crush remove testbed-node-2
removed item id -7 name 'testbed-node-2' from crush map -
Remove the node from all Ceph groups in the inventory.
-
Remove all Ceph-specific parameters from the host vars of the node from the inventory
Remove an OSD (temporarily e.g. when replacing a broken disk)
$ ceph osd out osd.<ID>
# systemctl stop ceph-osd@<ID>
# systemctl disable ceph-osd@<ID>
Disable backfills/recovery completely
Use only in emergency situations!
$ ceph osd set nobackfill
$ ceph osd set norecovery
$ ceph osd set norebalance
Unset the flags with ceph osd unset <flag>
.
Rebalance OSDs
Placement Group maintenance
Dump placement groups
Usually only useful when parsing it, so here are two ways to get the data:
$ ceph pg dump
$ ceph pg dump --format=json-pretty
Query a PG about its status
$ ceph pg <pgid> query
Start (deep-)scrubbing of a placement group
$ ceph pg scrub <pgid>
$ ceph pg deep-scrub <pgid>
Instructing a PG to (deep-)scrub does not mean that it will do so immediately, it can take some time for the scrub to start.
HEALTH_WARN - Large omap objects found...
Finding PGs which have large OMAP objects:
# ceph pg dump --format=json | jq '.pg_map.pg_stats[] |
select(.stat_sum.num_large_omap_objects != 0) |
(.pgid, .stat_sum.num_large_omap_objects, .up, .acting)'
(Remove the line breaks between the single quotes or jq
might act weird!)
This will dump all PG IDs with large OMAP objects and their up/acting OSDs. You then can grep the logs of these OSDs for "Large omap object" to find the actual objects causing the health warning.
Also the PG ID before the dot is equal to the pool ID it belongs to.
In case the logs have been rotated, instruct those OSDs to do a deep-scrub and watch the logs for the message to appear.
From there you can investigate the issue further, mostly it'll be due to the index of a RGW bucket getting too big due to too many objects, thus resharding that bucket's index will be necessary.
Instruct a PG to repair in case of scrub errors (inconsistent PG)
$ ceph pg repair <pgid>
Recovery might not start immediately and might take some time.
You can query the status of the recovery through ceph pg <pgid> query
.
Be sure to read the Ceph manual about this topic thoroughly:
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/
RADOS Pool maintenance
Read the RADOS pool operations documentation in detail before playing around with pools. Especially when considering making changes to the CRUSH map. Wrong decisions there can lead to data loss or other catastrophic failures.
Get pools and their configuration
$ ceph osd pool ls detail
Dump all CRUSH rules
$ ceph osd crush rule dump
Get autoscaler status
$ ceph osd pool autoscale-status
Create a replicated pool
$ ceph osd pool create <pool_name> <pg_num> <pgp_num> replicated [<crush_rule_name>]
Enabling an application on a pool
Required, otherwise a health warning will be raised after some time.
$ ceph osd pool application enable <pool_name> <application_name> # Syntax
$ ceph osd pool application enable cinder rbd # Example
Typical application names are: rbd, rgw, cephfs
Delete a pool
This will delete all data in that pool. There is no undo/undelete.
$ ceph osd pool delete <pool_name> <pool_name> --yes-i-really-really-mean-it
In order to be able to delete pools, it has to be enabled on the monitors
by setting the mon_allow_pool_delete
flag to true. Default is false.
See: https://docs.ceph.com/en/latest/rados/configuration/mon-config-ref
Set number of PGs for a pool
If no autoscaling of PGs is used, it is very important to adapt the PGs per pool to the real world when operating a Ceph cluster. If, for example, OSDs are exchanged, added, new nodes are added, etc., the number of PGs must also be taken into account.
The PG Calc Tool can be used to calculate a reasonable number of PGs per pool depending on all ODSs and pools.
Further information on placement groups can be found in the Ceph documentation. You should definitely read FACTORS RELEVANT TO SPECIFYING PG_NUM and CHOOSING THE NUMBER OF PGS there.
$ ceph osd pool set <poolname> pg_num <num_pgs>
Num PGs must be a power of two! Be careful about changing number of PGs. Changing pg_num to a new value will gradually increase pgp_num on newer versions of Ceph.
In older versions one also has to set pgp_num manually, either in increments or in one big leap.
Create CRUSH rules for different storage classes
$ ceph osd crush rule create-replicated replicated_hdd default host hdd
$ ceph osd crush rule create-replicated replicated_ssd default host ssd
$ ceph osd crush rule create-replicated replicated_nvme default host nvme
Change CRUSH rule for a pool ("move pool")
$ ceph osd pool set <poolname> crush_rule <rule_name>
This can be used to move a pool from e.g. HDD to SSD or NVME class or anything else that the new CRUSH rule specifies.
Advanced topics
Validating Ceph using OSISM playbooks
For Ceph, special playbooks were added to validate the deployment status of
the OSD, MON and MGR services. The commands for use are osism validate ceph-osds
,
osism validate ceph-mons
, and osism validate ceph-mgrs
.
These playbooks will validate that the deployed Ceph environment matches
the configuration and is overall in a healthy state. The playbooks will
generate report files in JSON format on the first manager node in /opt/reports/validator
.
Shutdown a Ceph cluster
In order to fully shutdown a Ceph cluster safely, you first do the following steps:
Take GOOD NOTES of the unit names and OSD IDs running on each node. You will need them to restart the cluster later.
-
Stop the workload that is using the cluster
This will vary depending on your environment and is not covered here.
-
Pause/Stop operations on the cluster by setting flags
$ ceph osd set noout
$ ceph osd set nobackfill
$ ceph osd set norecover
$ ceph osd set norebalance
$ ceph osd set nodown
$ ceph osd set pause -
Stop and disable the
radosgw
services on all nodes (on each rgw node) (if RGW is used)Get the name of the unit (globs not supported for disable) and make a note of the unit name for that node:
# systemctl | grep ceph-radosgw
Then disable and stop the unit:
# systemctl disable --now ceph-radosgw@<name>.service
-
Stop all CephFS file systems (if CephFS is used)
List all Ceph file systems
$ ceph fs ls
For each CephFS do:
$ ceph fs <file system name> down true
-
After that disable and stop all
ceph-mds
services on all nodes (do this on each node)Get the name of the unit (globs not supported for disable) and make a note of the unit name for that node:
# systemctl | grep ceph-mds
# systemctl disable --now ceph-mds@<unit>.service
-
Stop and disable the
ceph-mgr
services on all nodes (do this on each node)Get the name of the unit (globs not supported for disable) and make a note of the unit name for that node:
# systemctl | grep ceph-mgr
# systemctl disable --now ceph-mgr@<unit>.service
-
Stop and disable the
ceph-osd
services on all nodes (do this on each node)Get the names of the units (globs not supported for disable) and make a note of the unit names for that node (best to save it to a file):
# systemctl | grep ceph-osd
For each OSD unit execute:
# systemctl disable ceph-osd@<osd-id>.service
Stop all OSDs at once:
# systemctl stop ceph-osd\*.service
-
Finally stop the
ceph-mon
services on all nodes (do this on each node)Get the name of the unit (globs not supported for disable) and make a note of the unit name for that node:
# systemctl | grep ceph-mon
# systemctl disable --now ceph-mon@<unit>.service
Restart a Ceph cluster after manual shutdown
You will need the notes taken during shutdown of the unit names. It can be done without, but then it'll be way more work finding out the names.
In order to restart a Ceph cluster after performing a manual shutdown like described in the section above, you do the following:
-
Enable & start the
ceph-mon
services on all nodes (do this on each node)# systemctl enable --now ceph-mon@<unit-name>.service
-
Enable & start the
ceph-osd
services on all nodes (do this on each node)For each Ceph OSD on that node do:
# systemctl enable --now ceph-osd@<osd-id>.service
Depending on the number of OSDs on that node it can take a while.
-
Enable & start the
ceph-mgr
services on all nodes (do this on each node)# systemctl enable --now ceph-mgr@<unit-name>.service
-
Check the status of your cluster and wait for all OSDs to come online
You can watch the status periodically by running:
$ watch ceph -s
You should wait until all OSDs are up + in again, before removing flags.
-
Remove flags to unpause operations
$ ceph osd unset pause
$ ceph osd unset nodown
$ ceph osd unset noout
$ ceph osd unset nobackfill
$ ceph osd unset norecover
$ ceph osd unset norebalance -
Wait for cluster to resume operations
See step #4 of this SOP. Now you wait until the cluster seems "happy enough" to accept clients. (i.e. rebalancing finished etc.) Maybe it will complain about MDS being down, but that's normal for now.
-
Enable & start the
ceph-mds
services on each node (if CephFS is used)# systemctl enable --now ceph-mds@<unit>.service
-
Start CephFS file systems again
List all Ceph file systems
$ ceph fs ls
For each CephFS do:
$ ceph fs <file system name> down false
-
Enable & start the
radosgw
services on each node (if RGW is used)# systemctl enable --now ceph-radosgw@<name>.service
Performance benchmark
# apt-get install -y fio
#!/usr/bin/env bash
BENCH_DEVICE="$2"
DATE=$(date +%s)
IOENGINE="libaio"
LOGPATH="$1"
SIZE=1G
mkdir -p $LOGPATH
for RW in "write" "randwrite" "read" "randread"
do
for BS in "4K" "64K" "1M" "4M" "16M" "64M"
do
(
echo "==== $RW - $BS - DIRECT ===="
echo 3 > /proc/sys/vm/drop_caches
fio --rw=$RW --ioengine=${IOENGINE} --size=$SIZE --bs=$BS --direct=1 --runtime=60 --time_based --name=bench --filename=$BENCH_DEVICE --output=$LOGPATH/$RW.${BS}-direct-$(basename $BENCH_DEVICE).$DATE.log.json --output-format=json
sync
echo 3 > /proc/sys/vm/drop_caches
echo "==== $RW - $BS - DIRECT IODEPTH 32 ===="
fio --rw=$RW --ioengine=${IOENGINE} --size=$SIZE --bs=$BS --iodepth=32 --direct=1 --runtime=60 --time_based --name=bench --filename=$BENCH_DEVICE --output=$LOGPATH/$RW.${BS}-direct-iod32-$(basename $BENCH_DEVICE).$DATE.log.json --output-format=json
sync
) | tee $LOGPATH/$RW.$BS-$(basename $BENCH_DEVICE).$DATE.log
echo
done
done
Where and how to get further help
Join the #ceph IRC channel on irc.oftc.net, state the problem with as many details as possible including information about what steps have already been taken to solve the problem also provide information from the command output from the "60 seconds cluster overview" above through a pastebin or a similar service. In order for people to be able to help, details and some patience are important.