ceph管理里最常输入的命令可能是ceph health,它输出ceph的健康状态。
$ ceph health HEALTH_OK
如果返回不是HEALTH_OK就要注意,可能有PG处于非active + clean状态。对于有问题的PG,还可进一步运行ceph health detail命令,输出它们的详情。
$ ceph health detail HEALTH_OK
当然我这里没有问题PG存在,返回都是OK。
可以运行ceph -w命令监控集群的实时事件,包括INF(information)、WRN(warning)、ERR(error)事件。
$ ceph -w cluster 963a6787-0043-48e2-8677-a70f1564be17 health HEALTH_OK monmap e1: 1 mons at {ceph2=172.17.6.176:6789/0}, election epoch 1, quorum 0 ceph2 osdmap e64: 3 osds: 3 up, 3 in pgmap v84557: 384 pgs, 3 pools, 4879 MB data, 1599 objects 120 GB used, 102 GB / 235 GB avail 384 active+clean 2015-12-03 10:20:14.713946 mon.0 [INF] pgmap v84557: 384 pgs: 384 active+clean; 4879 MB data, 120 GB used, 102 GB / 235 GB avail
集群的空间使用统计,可以运行ceph df命令。该命令显示总的空间大小、可用空间、已用空间、使用百分比。它还进一步按pool进行统计。
$ ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 235G 102G 120G 51.34 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS data 0 978 0 35073M 3 metadata 1 0 0 35073M 0 rbd 2 4879M 2.02 35073M 1596
运行如下命令检查集群状态:
$ ceph status cluster 963a6787-0043-48e2-8677-a70f1564be17 health HEALTH_OK monmap e1: 1 mons at {ceph2=172.17.6.176:6789/0}, election epoch 1, quorum 0 ceph2 osdmap e64: 3 osds: 3 up, 3 in pgmap v84561: 384 pgs, 3 pools, 4879 MB data, 1599 objects 120 GB used, 102 GB / 235 GB avail 384 active+clean
它等同于ceph -s的输出。
运行如下命令获取集群的认证keys:
$ ceph auth list installed auth entries: ...
检查MON的状态和map,输入如下命令:
$ ceph mon stat e1: 1 mons at {ceph2=172.17.6.176:6789/0}, election epoch 1, quorum 0 ceph2 $ ceph mon dump dumped monmap epoch 1 epoch 1 fsid 963a6787-0043-48e2-8677-a70f1564be17 last_changed 0.000000 created 0.000000 0: 172.17.6.176:6789/0 mon.ceph2
运行如下命令检查集群的仲裁状态,集群应该总是有超过51%的MON服务健康存在。
$ ceph quorum_status|python -mjson.tool { "election_epoch": 1, "monmap": { "created": "0.000000", "epoch": 1, "fsid": "963a6787-0043-48e2-8677-a70f1564be17", "modified": "0.000000", "mons": [ { "addr": "172.17.6.176:6789/0", "name": "ceph2", "rank": 0 } ] }, "quorum": [ 0 ], "quorum_leader_name": "ceph2", "quorum_names": [ "ceph2" ] }
运行ceph osd tree检查OSD树:
$ ceph osd tree # id weight type name up/down reweight -1 0.24 root default -2 0.24 host ceph2 0 0.07999 osd.0 up 1 1 0.07999 osd.1 up 1 2 0.07999 osd.2 up 1
它显示OSD的有用信息,比如权重、UP/DOWN状态、IN/OUT状态等。
ceph osd dump也是非常有用的命令,它输出OSD的map版本、pool细节,包括pool ID、名字、类型、CRUSH规则集、PG数等。它还输出每个OSD的ID、状态、权重等信息。
$ ceph osd dump epoch 64 fsid 963a6787-0043-48e2-8677-a70f1564be17 created 2015-10-28 13:52:53.131559 modified 2015-11-30 09:58:21.147863 flags pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 43 flags hashpspool crash_replay_interval 45 stripe_width 0 snap 1 'snapshot01' 2015-11-05 11:41:13.296489 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 41 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 45 flags hashpspool stripe_width 0 removed_snaps [1~1] max_osd 3 osd.0 up in weight 1 up_from 52 up_thru 61 down_at 51 last_clean_interval [49,50) 172.17.6.176:6811/59955 172.17.6.176:6812/59955 172.17.6.176:6813/59955 172.17.6.176:6814/59955 exists,up 3711263c-0898-4eac-aae1-3c316e8c6287 osd.1 up in weight 1 up_from 55 up_thru 61 down_at 54 last_clean_interval [8,50) 172.17.6.176:6805/59895 172.17.6.176:6806/59895 172.17.6.176:6807/59895 172.17.6.176:6808/59895 exists,up 6ef617b4-9dd0-4155-b2be-44bafa02f3d6 osd.2 up in weight 1 up_from 54 up_thru 61 down_at 53 last_clean_interval [22,50) 172.17.6.176:6800/59830 172.17.6.176:6801/59830 172.17.6.176:6802/59830 172.17.6.176:6803/59830 exists,up 117105d1-e101-498e-a837-eeb1b568716c
运行ceph osd crush dump检查CRUSH map:
$ ceph osd crush dump
它的输出很长,包括CRUSH的完整视图。
如果ceph集群里有数量众多的OSD,有时难以发现它们在CRUSH map里的位置。那么如下命令变得有用:
$ ceph osd find 1|python -mjson.tool { "crush_location": { "host": "ceph2", "root": "default" }, "ip": "172.17.6.176:6805/59895", "osd": 1 }
find后面参数是OSD的ID。
除了命令行监控外,还有一些开源的web面板类监控工具,包括Kraken、ceph-dash、Calamari等,这里不详述。