A collection of notes on Ceph

Ceph is a very impressive storage system which allows different kinds of access to storage resources available in a cluster. I considered using it for my next NAS solution but at the very least I wanted to be able to help people with massive storage needs in setting up such a solution. I started using the old suite of tools for setting up a Ceph cluster but then proceeded to use the newer, less mature suite of tools. In the end I do not intend to use Ceph myself, partly because bugs are too prevalent for my taste but also because I am doing just fine with my 3TB storage array using ZFS. The following is a semi-random mess of notes I've made of the proceedings.

Update 2015-01-31: I've now set up Ceph using a Chef cookbook of my own design(couldn't get the community cookbook to work) which works a lot better but I still get into weird situations where Ceph just doesn't work. I'm no longer concerned about data persistence but data availability is still dodgy. When things work as intended I'm sure it offers super-high availability but when it's stuck in "PC Load Letter" state for two hours while I try to coddle it into working, I'm not happy. I am however still convinced that Ceph has higher potential than GlusterFS so I'll keep wrestling with it.

Installing Ceph the old way
Installing Ceph the new way
Summary

Installing Ceph the old way

Need to install ceph, ceph-mds and maybe ceph-fs as well. Need to allow passwordless root login. If an osd folder must be erased then we need to execute btrfs subvolume delete /srv/osd.0/*

cjp@ubuntu1:~$ sudo mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/ubuntu1.keyring
temp dir is /tmp/mkcephfs.iOufDzzFhs
preparing monmap in /tmp/mkcephfs.iOufDzzFhs/monmap
/usr/bin/monmaptool --create --clobber --add a 192.168.0.141:6789 --add b 192.168.0.142:6789 --add c 192.168.0.143:6789 --print /tmp/mkcephfs.iOufDzzFhs/monmap
/usr/bin/monmaptool: monmap file /tmp/mkcephfs.iOufDzzFhs/monmap
/usr/bin/monmaptool: generated fsid 1b9d71d1-3a70-4701-8e1a-d230304040a4
epoch 0
fsid 1b9d71d1-3a70-4701-8e1a-d230304040a4
last_changed 2013-05-03 20:13:46.538059
created 2013-05-03 20:13:46.538059
0: 192.168.0.141:6789/0 mon.a
1: 192.168.0.142:6789/0 mon.b
2: 192.168.0.143:6789/0 mon.c
/usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.iOufDzzFhs/monmap (3 monitors)
=== osd.0 ===
2013-05-03 20:13:46.800957 7fdacabed7c0 -1 provided osd id 0 != superblock's -1
2013-05-03 20:13:46.802032 7fdacabed7c0 -1  ** ERROR: error creating empty object store in /srv/osd.0: (22) Invalid argument
failed: '/sbin/mkcephfs -d /tmp/mkcephfs.iOufDzzFhs --init-daemon osd.0'
cjp@ubuntu1:~$ ls /srv/
osd.0  osd.0.journal
cjp@ubuntu1:~$ ls /srv/osd.0
ceph_fsid  current  fsid  magic  ready  snap_1  snap_2  store_version  whoami
cjp@ubuntu1:~$ sudo chown -R root:root /srv/
cjp@ubuntu1:~$ sudo chown -R root:root /etc/ceph/
cjp@ubuntu1:~$ sudo rm -rf /srv/osd.0/*
rm: cannot remove ‘/srv/osd.0/current’: Operation not permitted
rm: cannot remove ‘/srv/osd.0/snap_1’: Operation not permitted
rm: cannot remove ‘/srv/osd.0/snap_2’: Operation not permitted
cjp@ubuntu1:~$ sudo btrfs subvolume delete /srv/osd.0/*
Delete subvolume '/srv/osd.0/current'
Delete subvolume '/srv/osd.0/snap_1'
Delete subvolume '/srv/osd.0/snap_2'
cjp@ubuntu1:~$ sudo rm -rf /srv/osd.0
osd.0/         osd.0.journal
cjp@ubuntu1:~$ sudo rm -rf /srv/osd.0
osd.0/         osd.0.journal
cjp@ubuntu1:~$ sudo rm -rf /srv/osd.0*
cjp@ubuntu1:~$ sudo mkdir /srv/osd.0
cjp@ubuntu1:~$ sudo mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/ubuntu1.keyring
temp dir is /tmp/mkcephfs.UTg2NevMRI
preparing monmap in /tmp/mkcephfs.UTg2NevMRI/monmap
/usr/bin/monmaptool --create --clobber --add a 192.168.0.141:6789 --add b 192.168.0.142:6789 --add c 192.168.0.143:6789 --print /tmp/mkcephfs.UTg2NevMRI/monmap
/usr/bin/monmaptool: monmap file /tmp/mkcephfs.UTg2NevMRI/monmap
/usr/bin/monmaptool: generated fsid 121233d3-aa27-4d08-afe2-be6d437fbfe5
epoch 0
fsid 121233d3-aa27-4d08-afe2-be6d437fbfe5
last_changed 2013-05-03 20:15:37.487261
created 2013-05-03 20:15:37.487261
0: 192.168.0.141:6789/0 mon.a
1: 192.168.0.142:6789/0 mon.b
2: 192.168.0.143:6789/0 mon.c
/usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.UTg2NevMRI/monmap (3 monitors)
=== osd.0 ===
2013-05-03 20:15:37.706965 7f98327ba7c0 -1 filestore(/srv/osd.0) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2013-05-03 20:15:37.792886 7f98327ba7c0 -1 created object store /srv/osd.0 journal /srv/osd.0.journal for osd.0 fsid 121233d3-aa27-4d08-afe2-be6d437fbfe5
2013-05-03 20:15:37.793222 7f98327ba7c0 -1 already have key in keyring /etc/ceph/osd.0.keyring
=== osd.1 ===
pushing conf and monmap to ubuntu2:/tmp/mkfs.ceph.7dd4d53e5445f1458620029224af8605
2013-05-03 20:15:51.490093 7fc993f4b7c0 -1 filestore(/srv/osd.1) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2013-05-03 20:15:51.614964 7fc993f4b7c0 -1 created object store /srv/osd.1 journal /srv/osd.1.journal for osd.1 fsid 121233d3-aa27-4d08-afe2-be6d437fbfe5
2013-05-03 20:15:51.616163 7fc993f4b7c0 -1 auth: error reading file: /etc/ceph/osd.1.keyring: can't open /etc/ceph/osd.1.keyring: (2) No such file or directory
2013-05-03 20:15:51.616394 7fc993f4b7c0 -1 created new key in keyring /etc/ceph/osd.1.keyring
collecting osd.1 key
=== osd.2 ===
pushing conf and monmap to ubuntu3:/tmp/mkfs.ceph.60551ae62782010474495f047478f7c8
2013-05-03 20:15:56.475126 7f4fc61c37c0 -1 filestore(/srv/osd.2) limited size xattrs -- enable filestore_xattr_use_omap
2013-05-03 20:15:56.475442 7f4fc61c37c0 -1 OSD::mkfs: couldn't mount FileStore: error -95
2013-05-03 20:15:56.475589 7f4fc61c37c0 -1  ** ERROR: error creating empty object store in /srv/osd.2: (95) Operation not supported
failed: 'ssh root@ubuntu3 /sbin/mkcephfs -d /tmp/mkfs.ceph.60551ae62782010474495f047478f7c8 --init-daemon osd.2'
cjp@ubuntu1:~$ sudo mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/ubuntu1.keyring
temp dir is /tmp/mkcephfs.mPL6lFag7W
preparing monmap in /tmp/mkcephfs.mPL6lFag7W/monmap
/usr/bin/monmaptool --create --clobber --add a 192.168.0.141:6789 --add b 192.168.0.142:6789 --add c 192.168.0.143:6789 --print /tmp/mkcephfs.mPL6lFag7W/monmap
/usr/bin/monmaptool: monmap file /tmp/mkcephfs.mPL6lFag7W/monmap
/usr/bin/monmaptool: generated fsid 9741d923-05a5-4910-ad82-1196dc6ba900
epoch 0
fsid 9741d923-05a5-4910-ad82-1196dc6ba900
last_changed 2013-05-03 20:17:42.288670
created 2013-05-03 20:17:42.288670
0: 192.168.0.141:6789/0 mon.a
1: 192.168.0.142:6789/0 mon.b
2: 192.168.0.143:6789/0 mon.c
/usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.mPL6lFag7W/monmap (3 monitors)
=== osd.0 ===
2013-05-03 20:17:42.492156 7fbd62b407c0 -1 provided osd id 0 != superblock's -1
2013-05-03 20:17:42.493175 7fbd62b407c0 -1  ** ERROR: error creating empty object store in /srv/osd.0: (22) Invalid argument
failed: '/sbin/mkcephfs -d /tmp/mkcephfs.mPL6lFag7W --init-daemon osd.0'
cjp@ubuntu1:~$ sudo rm -rf /srv/osd.0/*
rm: cannot remove ‘/srv/osd.0/current’: Operation not permitted
rm: cannot remove ‘/srv/osd.0/snap_1’: Operation not permitted
rm: cannot remove ‘/srv/osd.0/snap_2’: Operation not permitted
cjp@ubuntu1:~$ sudo rm -rf /srv/osd.0*
rm: cannot remove ‘/srv/osd.0/snap_1’: Operation not permitted
rm: cannot remove ‘/srv/osd.0/snap_2’: Operation not permitted
rm: cannot remove ‘/srv/osd.0/current’: Operation not permitted
cjp@ubuntu1:~$ sudo btrfs subvolume delete /srv/osd.0/*
Delete subvolume '/srv/osd.0/current'
Delete subvolume '/srv/osd.0/snap_1'
Delete subvolume '/srv/osd.0/snap_2'
cjp@ubuntu1:~$ sudo mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/ubuntu1.keyring
temp dir is /tmp/mkcephfs.tN9lNysECU
preparing monmap in /tmp/mkcephfs.tN9lNysECU/monmap
/usr/bin/monmaptool --create --clobber --add a 192.168.0.141:6789 --add b 192.168.0.142:6789 --add c 192.168.0.143:6789 --print /tmp/mkcephfs.tN9lNysECU/monmap
/usr/bin/monmaptool: monmap file /tmp/mkcephfs.tN9lNysECU/monmap
/usr/bin/monmaptool: generated fsid 6340b71e-0a4c-4c45-969b-78444b05d33b
epoch 0
fsid 6340b71e-0a4c-4c45-969b-78444b05d33b
last_changed 2013-05-03 20:20:23.289217
created 2013-05-03 20:20:23.289217
0: 192.168.0.141:6789/0 mon.a
1: 192.168.0.142:6789/0 mon.b
2: 192.168.0.143:6789/0 mon.c
/usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.tN9lNysECU/monmap (3 monitors)
=== osd.0 ===
2013-05-03 20:20:23.539249 7f886eb897c0 -1 filestore(/srv/osd.0) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2013-05-03 20:20:23.585450 7f886eb897c0 -1 created object store /srv/osd.0 journal /srv/osd.0.journal for osd.0 fsid 6340b71e-0a4c-4c45-969b-78444b05d33b
2013-05-03 20:20:23.586106 7f886eb897c0 -1 already have key in keyring /etc/ceph/osd.0.keyring
=== osd.1 ===
pushing conf and monmap to ubuntu2:/tmp/mkfs.ceph.72127913a8ecc40028e107fef464500f
2013-05-03 20:20:26.763272 7f5e6437c7c0 -1 filestore(/srv/osd.1) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2013-05-03 20:20:26.811534 7f5e6437c7c0 -1 created object store /srv/osd.1 journal /srv/osd.1.journal for osd.1 fsid 6340b71e-0a4c-4c45-969b-78444b05d33b
2013-05-03 20:20:26.880487 7f5e6437c7c0 -1 already have key in keyring /etc/ceph/osd.1.keyring
collecting osd.1 key
=== osd.2 ===
pushing conf and monmap to ubuntu3:/tmp/mkfs.ceph.f2eaab3020550c0abc3f15faa3bac8dc
2013-05-03 20:20:41.133773 7fe0c4ec47c0 -1 filestore(/srv/osd.2) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2013-05-03 20:20:41.218473 7fe0c4ec47c0 -1 created object store /srv/osd.2 journal /srv/osd.2.journal for osd.2 fsid 6340b71e-0a4c-4c45-969b-78444b05d33b
2013-05-03 20:20:41.272328 7fe0c4ec47c0 -1 auth: error reading file: /etc/ceph/osd.2.keyring: can't open /etc/ceph/osd.2.keyring: (2) No such file or directory
2013-05-03 20:20:41.272597 7fe0c4ec47c0 -1 created new key in keyring /etc/ceph/osd.2.keyring
collecting osd.2 key
=== mds.a ===
creating private key for mds.a keyring /etc/ceph/mds.a.keyring
creating /etc/ceph/mds.a.keyring
Building generic osdmap from /tmp/mkcephfs.tN9lNysECU/conf
/usr/bin/osdmaptool: osdmap file '/tmp/mkcephfs.tN9lNysECU/osdmap'
/usr/bin/osdmaptool: writing epoch 1 to /tmp/mkcephfs.tN9lNysECU/osdmap
Generating admin key at /tmp/mkcephfs.tN9lNysECU/keyring.admin
creating /tmp/mkcephfs.tN9lNysECU/keyring.admin
Building initial monitor keyring
added entity mds.a auth auth(auid = 18446744073709551615 key=AQD//4NRkEBJIhAAGQ5jRxKbGACOLIHwA4/HSw== with 0 caps)
added entity osd.0 auth auth(auid = 18446744073709551615 key=AQDn9INRwML4ARAAgtTz/Juy1Ba4JNaNq2dfJA== with 0 caps)
added entity osd.1 auth auth(auid = 18446744073709551615 key=AQDX/oNRyAG8JBAA7ZOLFCTZKdoXxubIn+sVmw== with 0 caps)
added entity osd.2 auth auth(auid = 18446744073709551615 key=AQD5/4NRIAU+EBAACttE+AVbFA03PD+851DW4A== with 0 caps)
=== mon.a ===
/usr/bin/ceph-mon: created monfs at /srv/mon.a for mon.a
=== mon.b ===
pushing everything to ubuntu2
/usr/bin/ceph-mon: created monfs at /srv/mon.b for mon.b
=== mon.c ===
pushing everything to ubuntu3
/usr/bin/ceph-mon: created monfs at /srv/mon.c for mon.c
placing client.admin keyring in /etc/ceph/ubuntu1.keyring

Starting up the cluster.

root@ubuntu2:~# service ceph status
=== mon.b ===
mon.b: not running.
=== osd.1 ===
osd.1: not running.
root@ubuntu2:~# service ceph start
=== mon.b ===
Starting Ceph mon.b on ubuntu2...
starting mon.b rank 1 at 192.168.0.142:6789/0 mon_data /srv/mon.b fsid 6340b71e-0a4c-4c45-969b-78444b05d33b
=== osd.1 ===
Starting Ceph osd.1 on ubuntu2...
starting osd.1 at :/0 osd_data /srv/osd.1 /srv/osd.1.journal

Have to use sudo to access the keyrings.

cjp@ubuntu1:~$ ls /etc/ceph/
ceph.conf  mds.a.keyring  osd.0.keyring  ubuntu1.keyring
cjp@ubuntu1:~$ ceph -k /etc/ceph/ubuntu1.keyring -c /etc/ceph/ceph.conf health
2013-05-03 21:10:08.526903 7f1f7df21780 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
2013-05-03 21:10:08.526928 7f1f7df21780 -1 ceph_tool_common_init failed.
cjp@ubuntu1:~$ sudo ceph -k /etc/ceph/ubuntu1.keyring -c /etc/ceph/ceph.conf health
HEALTH_OK
cjp@ubuntu1:~$

Let's try to see how the data has been distributed.

cjp@ubuntu1:~$ du -sh /srv/osd.0
6.3M    /srv/osd.0
cjp@ubuntu1:~$ sudo ceph -k /etc/ceph/ubuntu1.keyring -c /etc/ceph/ceph.conf health
[sudo] password for cjp:
HEALTH_WARN 1 mons down, quorum 0,2 a,c
cjp@ubuntu1:~$ du -sh /srv/osd.0
6.3M    /srv/osd.0
cjp@ubuntu1:~$ du -sh /srv/osd.0
6.3M    /srv/osd.0
cjp@ubuntu1:~$ du -sh /srv/osd.0
osd.0/         osd.0.journal
cjp@ubuntu1:~$ du -sh /srv/osd.0
6.3M    /srv/osd.0
cjp@ubuntu1:~$ du -sh /srv/
du: cannot read directory ‘/srv/mon.a’: Permission denied
1007M   /srv/
cjp@ubuntu1:~$ sudo ceph -k /etc/ceph/ubuntu1.keyring -c /etc/ceph/ceph.conf health
HEALTH_WARN 496 pgs degraded; 493 pgs stuck unclean; recovery 953/2928 degraded (32.548%)
cjp@ubuntu1:~$ du -sh /srv/
du: cannot read directory ‘/srv/mon.a’: Permission denied
1.5G    /srv/
cjp@ubuntu1:~$ du -sh /srv/osd.0
437M    /srv/osd.0
cjp@ubuntu1:~$ du -sh /srv/osd.0
439M    /srv/osd.0
cjp@ubuntu1:~$ du -sh /srv/osd.0
439M    /srv/osd.0
cjp@ubuntu1:~$ sudo ceph -k /etc/ceph/ubuntu1.keyring -c /etc/ceph/ceph.conf health
HEALTH_WARN 12 pgs recovering; 136 pgs recovery_wait; 145 pgs stuck unclean; recovery 1577/3768 degraded (41.852%);  recovering 23 o/s, 3547KB/s
cjp@ubuntu1:~$
cjp@ubuntu1:~$ du -sh /srv/osd.0
471M    /srv/osd.0
cjp@ubuntu1:~$ sudo ceph -k /etc/ceph/ubuntu1.keyring -c /etc/ceph/ceph.conf health
HEALTH_OK
cjp@ubuntu1:~$

Installing Ceph the new way

First we add a special user for Ceph

ssh user@ceph-server
sudo useradd -d /home/ceph -m ceph
sudo passwd ceph

Next give that user full privileges:

echo "ceph ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph
sudo chmod 0440 /etc/sudoers.d/ceph

If you don't already have a key pair setup

ssh-keygen
Generating public/private key pair.
Enter file in which to save the key (/ceph-client/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /ceph-client/.ssh/id_rsa.
Your public key has been saved in /ceph-client/.ssh/id_rsa.pub.
cat .ssh/id_rsa >> .ssh/authorized_keys

Then distribute these keys to allow for password less login for the new user. Now let's try to install Ceph on a single node!

ceph@ceph1:~/ceph-deploy$ ./ceph-deploy new ceph1
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy install ceph1
OK
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy mon create ceph1
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk list ceph1
/dev/sdb :
 /dev/sda1 other, ext2, mounted on /boot
 /dev/sda2 other
 /dev/sda5 other, LVM2_member
/dev/sr0 other, unknown
ceph@ceph1:~/ceph-deploy$ sudo fdisk -l /dev/sda
Disk /dev/sda: 19.3 GB, 19327352832 bytes
255 heads, 63 sectors/track, 2349 cylinders, total 37748736 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0006f873
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      499711      248832   83  Linux
/dev/sda2          501758    37746687    18622465    5  Extended
/dev/sda5          501760    37746687    18622464   8e  Linux LVM
ceph@ceph1:~/ceph-deploy$ sudo fdisk -l /dev/sdb
Disk /dev/sdb: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/sdb doesn't contain a valid partition table
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk zap ceph1:sdb
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk prepare ceph1:sdb:/srv/osd1
Traceback (most recent call last):
  File "./ceph-deploy", line 9, in <module>
    load_entry_point('ceph-deploy==0.1', 'console_scripts', 'ceph-deploy')()
  File "/home/ceph/ceph-deploy/ceph_deploy/cli.py", line 112, in main
    return args.func(args)
  File "/home/ceph/ceph-deploy/ceph_deploy/osd.py", line 440, in disk
    prepare(args, cfg, activate_prepared_disk=False)
  File "/home/ceph/ceph-deploy/ceph_deploy/osd.py", line 217, in prepare
    key = get_bootstrap_osd_key(cluster=args.cluster)
  File "/home/ceph/ceph-deploy/ceph_deploy/osd.py", line 27, in get_bootstrap_os
d_key
    raise RuntimeError('bootstrap-osd keyring not found; run \\'gatherkeys\\'')
RuntimeError: bootstrap-osd keyring not found; run 'gatherkeys'
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy gatherkeys ceph1
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk prepare ceph1:sdb:/srv/osd1
ceph@ceph1:~/ceph-deploy$ sudo fdisk -l /dev/sdb
WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util fdisk doesn't support GPT. Use GNU Parted.
Disk /dev/sdb: 8589 MB, 8589934592 bytes
256 heads, 63 sectors/track, 1040 cylinders, total 16777216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1    16777215     8388607+  ee  GPT
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy osd activate ceph1:sdb1
ceph@ceph1:~/ceph-deploy$ ls /srv/
osd1
ceph@ceph1:~/ceph-deploy$ ls /srv/osd1
/srv/osd1
ceph@ceph1:~/ceph-deploy$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu1304--vg-root   17G  6.4G  9.7G  40% /
none                             4.0K     0  4.0K   0% /sys/fs/cgroup
udev                             236M   12K  236M   1% /dev
tmpfs                             50M  304K   49M   1% /run
none                             5.0M     0  5.0M   0% /run/lock
none                             246M     0  246M   0% /run/shm
none                             100M     0  100M   0% /run/user
/dev/sda1                        228M   30M  186M  14% /boot
/dev/sdb1                        8.0G   35M  8.0G   1% /var/lib/ceph/osd/ceph-0
ceph@ceph1:~/ceph-deploy$ ls /var/lib/ceph/osd/ceph-0
ceph@ceph1:~/ceph-deploy$ ls /var/lib/ceph/osd/ceph-0/
activate.monmap  current  keyring  store_version
active           fsid     magic    upstart
ceph_fsid        journal  ready    whoami
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy osd create ceph1
ceph-deploy: Must supply disk/path argument: ceph1
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy mds create ceph1
ceph@ceph1:~/ceph-deploy$ ceph
ceph                      ceph_filestore_dump
ceph-authtool             cephfs
ceph-clsinfo              ceph-fuse
ceph-conf                 ceph-mds
ceph-coverage             ceph-mon
ceph-create-keys          ceph_mon_store_converter
ceph-debugpack            ceph-osd
ceph-dencoder             ceph-rbdnamer
ceph-disk                 ceph-run
ceph-disk-activate        ceph-syn
ceph-disk-prepare
ceph@ceph1:~/ceph-deploy$ ceph osd lspools
2013-05-31 19:36:22.703342 7f74de557780 -1 monclient(hunting): ERROR: missing ke
yring, cannot use cephx for authentication
2013-05-31 19:36:22.704964 7f74de557780 -1 ceph_tool_common_init failed.
ceph@ceph1:~/ceph-deploy$ ls /etc/ceph/
ceph.client.admin.keyring  ceph.conf
ceph@ceph1:~/ceph-deploy$ nano /etc/ceph/ceph.conf
ceph@ceph1:~/ceph-deploy$ ls -lh
total 88K
-rwxrwxr-x 1 ceph ceph 1.3K May 31 18:45 bootstrap
-rw-rw-r-- 1 ceph ceph   72 May 31 19:28 ceph.bootstrap-mds.keyring
-rw-rw-r-- 1 ceph ceph   72 May 31 19:28 ceph.bootstrap-osd.keyring
-rw-rw-r-- 1 ceph ceph   64 May 31 19:28 ceph.client.admin.keyring
-rw-rw-r-- 1 ceph ceph  215 May 31 19:16 ceph.conf
drwxrwxr-x 3 ceph ceph 4.0K May 31 19:15 ceph_deploy
lrwxrwxrwx 1 ceph ceph   26 May 31 18:47 ceph-deploy -> virtualenv/bin/ceph-depl
oy
drwxrwxr-x 2 ceph ceph 4.0K May 31 18:46 ceph_deploy.egg-info
-rw-rw-r-- 1 ceph ceph 2.3K May 31 18:45 ceph-deploy.spec
-rw-rw-r-- 1 ceph ceph 3.3K May 31 19:34 ceph.log
-rw-rw-r-- 1 ceph ceph   73 May 31 19:15 ceph.mon.keyring
drwxrwxr-x 3 ceph ceph 4.0K May 31 18:45 debian
-rw-rw-r-- 1 ceph ceph 1.1K May 31 18:45 LICENSE
-rw-rw-r-- 1 ceph ceph   37 May 31 18:45 MANIFEST.in
-rw-rw-r-- 1 ceph ceph 6.3K May 31 18:45 README.rst
-rw-rw-r-- 1 ceph ceph   38 May 31 18:45 requirements-dev.txt
-rw-rw-r-- 1 ceph ceph   14 May 31 18:45 requirements.txt
drwxrwxr-x 2 ceph ceph 4.0K May 31 18:45 scripts
-rw-rw-r-- 1 ceph ceph   42 May 31 18:45 setup.cfg
-rw-rw-r-- 1 ceph ceph 1.7K May 31 18:45 setup.py
-rw-rw-r-- 1 ceph ceph   93 May 31 18:45 tox.ini
drwxrwxr-x 5 ceph ceph 4.0K May 31 18:47 virtualenv
ceph@ceph1:~/ceph-deploy$ less ceph.mon.keyring
ceph@ceph1:~/ceph-deploy$ ceph osd lspools
2013-05-31 19:41:47.233959 7f0243629780 -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication
ceph@ceph1:~/ceph-deploy$ sudo ceph osd lspools
0 data,1 metadata,2 rbd,
ceph@ceph1:~$ sudo ceph health
HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; recovery 21/42 degraded (50
.000%)
ceph@ceph1:~$ sudo ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    8180K     8146K     35176        0.42
POOLS:
    NAME         ID     USED     %USED     OBJECTS
    data         0      0        0         0
    metadata     1      9518     0         21
    rbd          2      0        0         0

----------------------------------------------------------------------------------

ceph@ceph1:~/ceph-deploy$ ./ceph-deploy mon create ceph1
ceph-mon: mon.noname-a 192.168.0.141:6789/0 is local, renaming to mon.ceph1
ceph-mon: set fsid to a559acba-18a0-4fb4-b54b-baff2f575b5f
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-ceph1 for mon.ceph1
ceph@ceph1:~/ceph-deploy$ ls /var/lib/ceph/bootstrap-*
/var/lib/ceph/bootstrap-mds:
/var/lib/ceph/bootstrap-osd:
ceph@ceph1:~/ceph-deploy$ sudo ls /var/lib/ceph/bootstrap-*
/var/lib/ceph/bootstrap-mds:
/var/lib/ceph/bootstrap-osd:
ceph@ceph1:~/ceph-deploy$ ls /etc/ceph/
ceph.conf
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy gatherkeys
usage: ceph-deploy gatherkeys [-h] HOST [HOST ...]
ceph-deploy gatherkeys: error: too few arguments
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy gatherkeys ceph1
Unable to find /etc/ceph/ceph.client.admin.keyring on ['ceph1']
Unable to find /var/lib/ceph/bootstrap-osd/ceph.keyring on ['ceph1']
Unable to find /var/lib/ceph/bootstrap-mds/ceph.keyring on ['ceph1']
ceph@ceph1:~/ceph-deploy$ sudo ceph
'2013-06-01 14:25:46.258167 7fb6e8a86780 -1 monclient(hunting): ERROR: missing k
eyring, cannot use cephx for authentication
2013-06-01 14:25:46.258236 7fb6e8a86780 -1 ceph_tool_common_init failed.
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy mon create ceph1 ceph2 ceph3
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy gatherkeys ceph1
Unable to find /etc/ceph/ceph.client.admin.keyring on ['ceph1']
Unable to find /var/lib/ceph/bootstrap-osd/ceph.keyring on ['ceph1']
Unable to find /var/lib/ceph/bootstrap-mds/ceph.keyring on ['ceph1']
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy gatherkeys ceph2
Unable to find /var/lib/ceph/bootstrap-mds/ceph.keyring on ['ceph2']
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy mds create ceph2
Traceback (most recent call last):
  File "./ceph-deploy", line 9, in <module>
    load_entry_point('ceph-deploy==0.1', 'console_scripts', 'ceph-deploy')()
  File "/home/ceph/ceph-deploy/ceph_deploy/cli.py", line 112, in main
    return args.func(args)
  File "/home/ceph/ceph-deploy/ceph_deploy/mds.py", line 195, in mds
    mds_create(args)
  File "/home/ceph/ceph-deploy/ceph_deploy/mds.py", line 140, in mds_create
    key = get_bootstrap_mds_key(cluster=args.cluster)
  File "/home/ceph/ceph-deploy/ceph_deploy/mds.py", line 24, in get_bootstrap_md
s_key
    raise RuntimeError('bootstrap-mds keyring not found; run \\'gatherkeys\\'')
RuntimeError: bootstrap-mds keyring not found; run 'gatherkeys'
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy gatherkeys ceph3
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy mds create ceph2
ceph@ceph1:~/ceph-deploy$ ls
bootstrap                   ceph-deploy.spec      requirements.txt
ceph.bootstrap-mds.keyring  ceph.log              scripts
ceph.bootstrap-osd.keyring  ceph.mon.keyring      setup.cfg
ceph.client.admin.keyring   debian                setup.py
ceph.conf                   LICENSE               tox.ini
ceph_deploy                 MANIFEST.in           virtualenv
ceph-deploy                 README.rst
ceph_deploy.egg-info        requirements-dev.txt
ceph@ceph1:~/ceph-deploy$ ls /etc/ceph/
ceph.client.admin.keyring  ceph.conf
ceph@ceph1:~/ceph-deploy$ sudo ceph health
HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds; clock skew de
tected on mon.ceph2, mon.ceph3

Need NTP it seems. (Not that it helps, maybe because of the use of underpowered virtual machines)

ceph@ceph1:~/ceph-deploy$ sudo ceph df
GLOBAL:
    SIZE     AVAIL     RAW USED     %RAW USED
    0        0         0            -nan
POOLS:
    NAME         ID     USED     %USED     OBJECTS
    data         0      0        -nan      0
    metadata     1      0        -nan      0
    rbd          2      0        -nan      0

Let's try to bring the cluster up one node at a time.

ceph@ceph1:~$ sudo ceph health
2013-06-01 16:10:15.124969 7f05da69a700  0 -- :/2254 >> 192.168.0.142:6789/0 pip
e(0x1fce4b0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-06-01 16:10:20.795912 7f05d4d1b700  0 -- :/2254 >> 192.168.0.143:6789/0 pip
e(0x7f05cc000c00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-06-01 16:10:21.127895 7f05da69a700  0 -- :/2254 >> 192.168.0.142:6789/0 pip
e(0x7f05cc003010 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-06-01 16:10:27.128198 7f05d4c1a700  0 -- 192.168.0.141:0/2254 >> 192.168.0.
142:6789/0 pipe(0x7f05cc004000 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault

So I guess this means it couldn't find the two nodes that hadn't been turned on yet, which is reasonable.

ceph@ceph1:~$ sudo ceph health
2013-06-01 16:19:45.828195 7fd7dba67700  0 -- :/3279 >> 192.168.0.143:6789/0 pip
e(0x15534b0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-06-01 16:19:51.827857 7fd7dba67700  0 -- 192.168.0.141:0/3279 >> 192.168.0.
143:6789/0 pipe(0x7fd7cc001d50 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; mds cluster is degrade
d; no osds; 1 mons down, quorum 0,1 ceph1,ceph2; clock skew detected on mon.ceph
2

Another node up.

ceph@ceph1:~$ sudo ceph health
2013-06-01 16:23:12.884989 7fea49e25700  0 -- :/3305 >> 192.168.0.143:6789/0 pipe(0x14ef4b0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
2013-06-01 16:23:18.884977 7fea443a5700  0 -- 192.168.0.141:0/3305 >> 192.168.0.143:6789/0 pipe(0x7fea3c001d50 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; mds cluster is degraded; no osds; clock skew detected on mon.ceph2, mon.ceph3

Third node seems sluggish. Eventually:

ceph@ceph1:~$ sudo ceph health
HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; mds cluster is degrade
d; no osds; clock skew detected on mon.ceph2, mon.ceph3

Now I'm curious about what keys are distributed and which processes are running.

ceph1:

ceph@ceph1:~$ ls /var/lib/ceph/*
/var/lib/ceph/bootstrap-mds:
ceph.keyring
/var/lib/ceph/bootstrap-osd:
ceph.keyring
/var/lib/ceph/mds:
/var/lib/ceph/mon:
ceph-ceph1
/var/lib/ceph/osd:
/var/lib/ceph/tmp:

Processes relating to ceph:

/usr/bin/ceph-mon --cluster=ceph -i ceph1 -f

ceph2:

ceph@ceph2:~$ ls /var/lib/ceph/*
/var/lib/ceph/bootstrap-mds:
ceph.keyring
/var/lib/ceph/bootstrap-osd:
ceph.keyring
/var/lib/ceph/mds:
ceph-ceph2
/var/lib/ceph/mon:
ceph-ceph2
/var/lib/ceph/osd:
/var/lib/ceph/tmp:

Processes relating to ceph:

/usr/bin/ceph-mon --cluster=ceph -i ceph2 -f
/usr/bin/ceph-mds --cluster=ceph -i ceph2 -f

ceph3

ceph@ceph3:~$ ls /var/lib/ceph/*
/var/lib/ceph/bootstrap-mds:
ceph.keyring
/var/lib/ceph/bootstrap-osd:
ceph.keyring
/var/lib/ceph/mds:
/var/lib/ceph/mon:
ceph-ceph3
/var/lib/ceph/osd:
/var/lib/ceph/tmp:

Processes relating to ceph:

/usr/bin/ceph-mon --cluster=ceph -i ceph3 -f

Let's start adding OSDs.

ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk list ceph1
/dev/sdb :
 /dev/sda1 other, ext2, mounted on /boot
 /dev/sda2 other
 /dev/sda5 other, LVM2_member
/dev/sr0 other, unknown
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk list ceph2
/dev/sdb :
 /dev/sda1 other, ext2, mounted on /boot
 /dev/sda2 other
 /dev/sda5 other, LVM2_member
/dev/sr0 other, unknown
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk list ceph3
/dev/sda :
 /dev/sda1 other, ext2, mounted on /boot
 /dev/sda2 other
 /dev/sda5 other, LVM2_member
/dev/sr0 other, unknown
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk list ceph3
/dev/sda :
 /dev/sda1 other, ext2, mounted on /boot
 /dev/sda2 other
 /dev/sda5 other, LVM2_member
/dev/sr0 other, unknown
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk list ceph3
/dev/sda :
 /dev/sda1 other, ext2, mounted on /boot
 /dev/sda2 other
 /dev/sda5 other, LVM2_member
/dev/sr0 other, unknown
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk list ceph2
/dev/sda :
 /dev/sda1 other, ext2, mounted on /boot
 /dev/sda2 other
 /dev/sda5 other, LVM2_member
/dev/sr0 other, unknown
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk list ceph1
/dev/sda :
 /dev/sda1 other, ext2, mounted on /boot
 /dev/sda2 other
 /dev/sda5 other, LVM2_member
/dev/sr0 other, unknown
ceph@ceph1:~/ceph-deploy$ sudo fdisk -l /dev/sdb
Disk /dev/sdb: 8589 MB, 8589934592 bytes
255 heads, 63 sectors/track, 1044 cylinders, total 16777216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/sdb doesn't contain a valid partition table
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk zap ceph1:sdb
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk list ceph1
/dev/sdb :
 /dev/sda1 other, ext2, mounted on /boot
 /dev/sda2 other
 /dev/sda5 other, LVM2_member
/dev/sr0 other, unknown
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy osd prepare ceph1:sdb

This works well but now ceph3's mon has gone screwy. It stays in probing mode forever.

[Two hours later] So restarting a monitor that has hung is best done by

ceph@ceph1$ sudo ceph mon remove ceph3
ceph@ceph3$ sudo mv /var/lib/ceph/mon/ceph-ceph3 /tmp
ceph@ceph3$ sudo mkdir /var/lib/ceph/mon/ceph-ceph3
ceph@ceph3:~$ sudo ceph mon getmap -o /var/lib/ceph/tmp/map.ceph
ceph@ceph3:~$ sudo ceph auth get mon. -o /var/lib/ceph/tmp/key.ceph
ceph@ceph3:~$ sudo ceph-mon -i ceph3 --mkfs --monmap /var/lib/ceph/tmp/map.ceph --keyring /var/lib/ceph/tmp/key.ceph
ceph-mon: set fsid to a559acba-18a0-4fb4-b54b-baff2f575b5f
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-ceph3 for mon.ceph3
ceph@ceph3:~$ sudo ceph mon add ceph3 192.168.0.143
port defaulted to 6789added mon.ceph3 at 192.168.0.143:6789/0
ceph@ceph3:~$ sudo ceph mon stat
e10: 3 mons at {ceph1=192.168.0.141:6789/0,ceph2=192.168.0.142:6789/0,ceph3=192.168.0.143:6789/0}, election epoch 32, quorum 0,1 ceph1,ceph2
ceph@ceph3:~$ sudo ceph-mon -i ceph3 --public-addr 192.168.0.143:6789 -f -d

Time well spent.

Ceph summary

A lot has happened in the world of Ceph in recent months it seems. The article in Admin Magazine talked about mkcephfs and /etc/ceph/ceph.conf. Today ceph-deploy is the recommended method to deploy a Ceph cluster and /etc/ceph/ceph.conf seems best used to override the default settings for journal size or timeout as Ceph keeps configuration information under /var/lib/ceph. Maybe /etc/ceph/ceph.conf is used for starting up the cluster again from total shutdown? I'll check.

There are some bugs that appear here and there but none that threaten data integrity, which is good. The bugs just manifest themselves in the cluster not working or not accepting new nodes. Bad, but not as bad as data loss. To set up a cluster using ceph-deploy requires the creation of accounts to which one can gain passwordless access via SSH and that can execute programs as root without passwords. The Ceph Quick Start Guide shows how this is done.

A Ceph cluster begins with a single node which is the admin. It's special status is reserved for the initial configuration. Once the cluster is up and running it doesn't matter which node initialized the system.

# Create a seed configuration(local)
ceph-deploy new ceph1 ceph2 ceph3
#Next we install the software on the nodes.
ceph-deploy install ceph1 ceph2 ceph3

At this state only the admin node has any record of how the cluster is supposed to be created. The other nodes just have the software needed to participate in the cluster. To get the cluster running we need at least one node to be a monitor. It is recommended that you use three monitor nodes or more and that's what I did:

# Create three monitor nodes
ceph-deploy mon create ceph1 ceph2 ceph3

After this stage all three nodes should have a ceph-mon process running and local copies of keys used to authenticate one another. Several versions of Ceph seem to have a problem where this step fails! I had to install my Ceph cluster using

ceph-deploy install --dev=wip-ceph-tool ceph1 ceph2 ceph3

to make the Monitor creation step work as intended. All nodes should now have the following files installed, with identical content:

/var/lib/ceph/bootstrap-mds/ceph.keyring:
[client.bootstrap-mds]
        key = AQCq6KlRYIiSGxAAx2tod+nCMr+F7mdBFXTPbA==

/var/lib/ceph/bootstrap-osd/ceph.keyring:
[client.bootstrap-osd]
        key = AQCq6KlR0HJ/FRAAC1SGGE3d/HOXPR1cz///2A==

Each node should have a folder with a name like

/var/lib/ceph/mon/{cluster-name}-{node-name}/

So on my third node(ceph3) in my cluster(ceph) the folder is called

/var/lib/ceph/mon/ceph-ceph3/

It is important that these folders contain a keyring file with the same contents across all nodes:

[mon.]
        key = AQC+46lRAAAAABAAkmBEgm09T8Xna9e0VVumzg==
        caps mon = "allow *"

Now we can start adding OSDs which actually store data for the end user. The following command _might_ show you the disks available on a node:

ceph-deploy disk list ceph1

Sometimes it doesn't... I had to find the disk label manually for my third node - which interestingly enough is a clone of a the same virtual machine used for ceph1 and ceph2! To clean out a disk /dev/sdb on node ceph1:

ceph-deploy disk zap ceph1:sdb

To initialize and activate it:

ceph-deploy osd prepare ceph1:sdb
ceph-deploy osd activate ceph1:sdb

Repeat this for all nodes in your cluster that you wan't as a data store.

Monitors are the processes that organize the work and help users find their files. OSDs store data. The final kind of process/entity is the Metadata server, MDS. TO create one we just execute the following code:

ceph-deploy mds create ceph2

Now our cluster should be up and running. Me? I ran into a little trouble with my MDS. It was degraded, and starting the mds process manually yielded this:

ceph@ceph2:~$ sudo /usr/bin/ceph-mds --cluster=ceph -i ceph2 -f
starting mds.ceph2 at :/0
mds/MDSTable.cc: In function 'void MDSTable::load_2(int, ceph::bufferlist&, Cont
ext*)' thread 7f5eebe3e700 time 2013-06-01 21:04:42.875932
mds/MDSTable.cc: 150: FAILED assert(0)
 ceph version 0.61-121-g1a08418 (1a08418b655fb476814f028f4c63ca8f63cfbb0c)
 1: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x3bb) [0x6d873b]
 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe4b) [0x72297b]
 3: (MDS::handle_core_message(Message*)+0xae7) [0x50e467]
 4: (MDS::_dispatch(Message*)+0x33) [0x50e563]
 5: (MDS::ms_dispatch(Message*)+0xab) [0x51034b]
 6: (DispatchQueue::entry()+0x3c3) [0x844463]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c7bcd]
 8: (()+0x7f8e) [0x7f5ef01c2f8e]
 9: (clone()+0x6d) [0x7f5eee957e1d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to int
erpret this.
2013-06-01 21:04:42.884087 7f5eebe3e700 -1 mds/MDSTable.cc: In function 'void MD
STable::load_2(int, ceph::bufferlist&, Context*)' thread 7f5eebe3e700 time 2013-
06-01 21:04:42.875932
mds/MDSTable.cc: 150: FAILED assert(0)
 ceph version 0.61-121-g1a08418 (1a08418b655fb476814f028f4c63ca8f63cfbb0c)
 1: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x3bb) [0x6d873b]
 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe4b) [0x72297b]
 3: (MDS::handle_core_message(Message*)+0xae7) [0x50e467]
 4: (MDS::_dispatch(Message*)+0x33) [0x50e563]
 5: (MDS::ms_dispatch(Message*)+0xab) [0x51034b]
 6: (DispatchQueue::entry()+0x3c3) [0x844463]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c7bcd]
 8: (()+0x7f8e) [0x7f5ef01c2f8e]
 9: (clone()+0x6d) [0x7f5eee957e1d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to int
erpret this.
     0> 2013-06-01 21:04:42.884087 7f5eebe3e700 -1 mds/MDSTable.cc: In function 
'void MDSTable::load_2(int, ceph::bufferlist&, Context*)' thread 7f5eebe3e700 ti
me 2013-06-01 21:04:42.875932
mds/MDSTable.cc: 150: FAILED assert(0)
 ceph version 0.61-121-g1a08418 (1a08418b655fb476814f028f4c63ca8f63cfbb0c)
 1: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x3bb) [0x6d873b]
 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe4b) [0x72297b]
 3: (MDS::handle_core_message(Message*)+0xae7) [0x50e467]
 4: (MDS::_dispatch(Message*)+0x33) [0x50e563]
 5: (MDS::ms_dispatch(Message*)+0xab) [0x51034b]
 6: (DispatchQueue::entry()+0x3c3) [0x844463]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c7bcd]
 8: (()+0x7f8e) [0x7f5ef01c2f8e]
 9: (clone()+0x6d) [0x7f5eee957e1d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to int
erpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
 in thread 7f5eebe3e700
 ceph version 0.61-121-g1a08418 (1a08418b655fb476814f028f4c63ca8f63cfbb0c)
 1: /usr/bin/ceph-mds() [0x872ea0]
 2: (()+0xfbd0) [0x7f5ef01cabd0]
 3: (gsignal()+0x37) [0x7f5eee895037]
 4: (abort()+0x148) [0x7f5eee898698]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f5eef1a1e8d]
 6: (()+0x5ef76) [0x7f5eef19ff76]
 7: (()+0x5efa3) [0x7f5eef19ffa3]
 8: (()+0x5f1de) [0x7f5eef1a01de]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x43d)
 [0x7d7f7d]
 10: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x3bb) [0x6d873b]
 11: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe4b) [0x72297b]
 12: (MDS::handle_core_message(Message*)+0xae7) [0x50e467]
 13: (MDS::_dispatch(Message*)+0x33) [0x50e563]
 14: (MDS::ms_dispatch(Message*)+0xab) [0x51034b]
 15: (DispatchQueue::entry()+0x3c3) [0x844463]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c7bcd]
 17: (()+0x7f8e) [0x7f5ef01c2f8e]
 18: (clone()+0x6d) [0x7f5eee957e1d]
2013-06-01 21:04:42.892112 7f5eebe3e700 -1 *** Caught signal (Aborted) **
 in thread 7f5eebe3e700
 ceph version 0.61-121-g1a08418 (1a08418b655fb476814f028f4c63ca8f63cfbb0c)
 1: /usr/bin/ceph-mds() [0x872ea0]
 2: (()+0xfbd0) [0x7f5ef01cabd0]
 3: (gsignal()+0x37) [0x7f5eee895037]
 4: (abort()+0x148) [0x7f5eee898698]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f5eef1a1e8d]
 6: (()+0x5ef76) [0x7f5eef19ff76]
 7: (()+0x5efa3) [0x7f5eef19ffa3]
 8: (()+0x5f1de) [0x7f5eef1a01de]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x43d)
 [0x7d7f7d]
 10: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x3bb) [0x6d873b]
 11: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe4b) [0x72297b]
 12: (MDS::handle_core_message(Message*)+0xae7) [0x50e467]
 13: (MDS::_dispatch(Message*)+0x33) [0x50e563]
 14: (MDS::ms_dispatch(Message*)+0xab) [0x51034b]
 15: (DispatchQueue::entry()+0x3c3) [0x844463]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c7bcd]
 17: (()+0x7f8e) [0x7f5ef01c2f8e]
 18: (clone()+0x6d) [0x7f5eee957e1d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to int
erpret this.
     0> 2013-06-01 21:04:42.892112 7f5eebe3e700 -1 *** Caught signal (Aborted) *
*
 in thread 7f5eebe3e700
 ceph version 0.61-121-g1a08418 (1a08418b655fb476814f028f4c63ca8f63cfbb0c)
 1: /usr/bin/ceph-mds() [0x872ea0]
 2: (()+0xfbd0) [0x7f5ef01cabd0]
 3: (gsignal()+0x37) [0x7f5eee895037]
 4: (abort()+0x148) [0x7f5eee898698]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f5eef1a1e8d]
 6: (()+0x5ef76) [0x7f5eef19ff76]
 7: (()+0x5efa3) [0x7f5eef19ffa3]
 8: (()+0x5f1de) [0x7f5eef1a01de]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x43d)
 [0x7d7f7d]
 10: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x3bb) [0x6d873b]
 11: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe4b) [0x72297b]
 12: (MDS::handle_core_message(Message*)+0xae7) [0x50e467]
 13: (MDS::_dispatch(Message*)+0x33) [0x50e563]
 14: (MDS::ms_dispatch(Message*)+0xab) [0x51034b]
 15: (DispatchQueue::entry()+0x3c3) [0x844463]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x7c7bcd]
 17: (()+0x7f8e) [0x7f5ef01c2f8e]
 18: (clone()+0x6d) [0x7f5eee957e1d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to int
erpret this.

I figured I could delete the MDS and reinstall it but, no go:

ceph@ceph1$ ceph-deploy mds destroy ceph2
subcommand destroy not implemented

This sorts things out but I suspect it's not the sort of thing you want to do when the file system actually contains something.

ceph@ceph1:~/ceph-deploy$ sudo ceph mds newfs 1 2 --yes-i-really-mean-it

At least it meant I could do some actual stuff, like copy files to the Ceph filesystem and use ceph -w to see the data transfer being recorded!

ceph@ceph1:~/ceph-deploy$ sudo ceph -w
   health HEALTH_WARN clock skew detected on mon.ceph2
   monmap e10: 3 mons at {ceph1=192.168.0.141:6789/0,ceph2=192.168.0.142:6789/0,
ceph3=192.168.0.143:6789/0}, election epoch 42, quorum 0,1,2 ceph1,ceph2,ceph3
   osdmap e113: 3 osds: 3 up, 3 in
    pgmap v359: 192 pgs: 192 active+clean; 1405 KB data, 123 MB used, 21347 MB /
 21470 MB avail
   mdsmap e849: 1/1/1 up {0=ceph2=up:active}
2013-06-01 21:51:44.543047 mon.0 [INF] pgmap v359: 192 pgs: 192 active+clean; 14
05 KB data, 123 MB used, 21347 MB / 21470 MB avail
2013-06-01 21:53:38.546402 mon.0 [INF] pgmap v360: 192 pgs: 192 active+clean; 14
12 KB data, 123 MB used, 21347 MB / 21470 MB avail; 89B/s wr, 0op/s
2013-06-01 21:53:43.591009 mon.0 [INF] pgmap v361: 192 pgs: 192 active+clean; 14
13 KB data, 123 MB used, 21347 MB / 21470 MB avail; 103B/s wr, 0op/s
2013-06-01 21:53:44.865075 mon.0 [INF] pgmap v362: 192 pgs: 192 active+clean; 14
13 KB data, 123 MB used, 21347 MB / 21470 MB avail; 333B/s wr, 0op/s
2013-06-01 21:53:49.464808 mon.0 [INF] pgmap v363: 192 pgs: 192 active+clean; 55
11 KB data, 131 MB used, 21339 MB / 21470 MB avail; 818KB/s wr, 0op/s
2013-06-01 21:53:51.920173 mon.0 [INF] pgmap v364: 192 pgs: 192 active+clean; 17
799 KB data, 135 MB used, 21335 MB / 21470 MB avail; 2843KB/s wr, 0op/s
2013-06-01 21:53:54.977657 mon.0 [INF] pgmap v365: 192 pgs: 192 active+clean; 25
991 KB data, 167 MB used, 21303 MB / 21470 MB avail; 3500KB/s wr, 0op/s
2013-06-01 21:53:59.511038 mon.0 [INF] pgmap v366: 192 pgs: 192 active+clean; 34
185 KB data, 191 MB used, 21279 MB / 21470 MB avail; 2021KB/s wr, 0op/s
2013-06-01 21:54:01.178837 mon.0 [INF] pgmap v367: 192 pgs: 192 active+clean; 58
761 KB data, 259 MB used, 21210 MB / 21470 MB avail; 5646KB/s wr, 1op/s
2013-06-01 21:54:03.823909 mon.0 [INF] osdmap e114: 3 osds: 3 up, 3 in
2013-06-01 21:54:04.320530 mon.0 [INF] pgmap v368: 192 pgs: 192 active+clean; 58
761 KB data, 259 MB used, 21210 MB / 21470 MB avail; 4915KB/s wr, 1op/s
2013-06-01 21:54:06.328516 mon.0 [INF] pgmap v369: 192 pgs: 192 active+clean; 95
625 KB data, 283 MB used, 21186 MB / 21470 MB avail; 7635KB/s wr, 1op/s
2013-06-01 21:54:08.099946 mon.0 [INF] osdmap e115: 3 osds: 3 up, 3 in
2013-06-01 21:54:09.527511 mon.0 [INF] pgmap v370: 192 pgs: 192 active+clean; 95
625 KB data, 283 MB used, 21186 MB / 21470 MB avail; 8612KB/s wr, 2op/s
2013-06-01 21:54:15.709133 mon.0 [INF] pgmap v371: 192 pgs: 192 active+clean; 11
7 MB data, 352 MB used, 21118 MB / 21470 MB avail; 2451KB/s wr, 0op/s
2013-06-01 21:54:19.147960 mon.0 [INF] pgmap v372: 192 pgs: 192 active+clean; 12
5 MB data, 400 MB used, 21070 MB / 21470 MB avail; 3136KB/s wr, 0op/s
2013-06-01 21:54:21.497682 mon.0 [INF] pgmap v373: 192 pgs: 192 active+clean; 12
5 MB data, 408 MB used, 21062 MB / 21470 MB avail; 1571KB/s wr, 0op/s
2013-06-01 21:54:33.088992 mon.1 [INF] mon.ceph2 calling new monitor election
2013-06-01 21:54:34.032669 mon.0 [INF] mon.ceph1 calling new monitor election
2013-06-01 21:54:39.320624 mon.0 [INF] mon.ceph1@0 won leader election with quor
um 0,1
2013-06-01 21:54:40.793482 mon.0 [INF] pgmap v373: 192 pgs: 192 active+clean; 12
5 MB data, 408 MB used, 21062 MB / 21470 MB avail; 1571KB/s wr, 0op/s
2013-06-01 21:54:40.793811 mon.0 [INF] mdsmap e849: 1/1/1 up {0=ceph2=up:active}
2013-06-01 21:54:40.794204 mon.0 [INF] osdmap e115: 3 osds: 3 up, 3 in
2013-06-01 21:54:40.795148 mon.0 [INF] monmap e10: 3 mons at {ceph1=192.168.0.14
1:6789/0,ceph2=192.168.0.142:6789/0,ceph3=192.168.0.143:6789/0}
2013-06-01 21:54:42.362122 mon.0 [INF] pgmap v374: 192 pgs: 192 active+clean; 17
3 MB data, 468 MB used, 21002 MB / 21470 MB avail; 2106KB/s wr, 0op/s
2013-06-01 21:54:45.010251 mon.0 [INF] osdmap e116: 3 osds: 3 up, 3 in
2013-06-01 21:54:46.618913 mon.0 [INF] pgmap v375: 192 pgs: 192 active+clean; 17
3 MB data, 468 MB used, 21002 MB / 21470 MB avail; 2012KB/s wr, 0op/s
2013-06-01 21:54:46.978899 mon.0 [INF] mon.ceph1 calling new monitor election
2013-06-01 21:54:47.555106 mon.0 [INF] mon.ceph1@0 won leader election with quor
um 0,1,2
2013-06-01 21:54:48.172663 mon.0 [INF] pgmap v375: 192 pgs: 192 active+clean; 17
3 MB data, 468 MB used, 21002 MB / 21470 MB avail; 2012KB/s wr, 0op/s
2013-06-01 21:54:48.172904 mon.0 [INF] mdsmap e849: 1/1/1 up {0=ceph2=up:active}
2013-06-01 21:54:48.173121 mon.0 [INF] osdmap e116: 3 osds: 3 up, 3 in
2013-06-01 21:54:21.155899 osd.2 [WRN] 1 slow requests, 1 included below; oldest
 blocked for > 30.014315 secs
2013-06-01 21:54:21.155982 osd.2 [WRN] slow request 30.014315 seconds old, recei
ved at 2013-06-01 21:53:51.141263: osd_op(client.4881.1:30 10000000002.0000001b 
[write 0~4194304] 2.a46b3cc7 snapc 1=[] e113) currently waiting for subops from 
[0]
2013-06-01 21:54:22.156386 osd.2 [WRN] 4 slow requests, 3 included below; oldest
 blocked for > 31.015038 secs
2013-06-01 21:54:22.156425 osd.2 [WRN] slow request 30.932480 seconds old, recei
ved at 2013-06-01 21:53:51.223821: osd_op(client.4881.1:37 10000000002.00000022 
[write 0~4194304] 2.42dbbeb0 snapc 1=[] e113) currently waiting for subops from 
[0]
2013-06-01 21:54:22.156478 osd.2 [WRN] slow request 30.862795 seconds old, recei
ved at 2013-06-01 21:53:51.293506: osd_op(client.4881.1:40 10000000002.00000025 
[write 0~4194304] 2.2eb5a20f snapc 1=[] e113) currently waiting for subops from 
[0]
2013-06-01 21:54:28.524834 osd.0 [WRN] 5 slow requests, 5 included below; oldest
 blocked for > 36.566703 secs
2013-06-01 21:54:28.524918 osd.0 [WRN] slow request 36.566703 seconds old, recei
ved at 2013-06-01 21:53:51.957861: osd_op(client.4881.1:41 10000000002.00000026 
[write 0~4194304] 2.68bafaf snapc 1=[] e113) currently waiting for subops from [
2]
2013-06-01 21:54:28.524923 osd.0 [WRN] slow request 36.414317 seconds old, recei
ved at 2013-06-01 21:53:52.110247: osd_op(client.4881.1:44 10000000002.00000029 
[write 0~4194304] 2.83d38699 snapc 1=[] e113) currently waiting for subops from 
[2]
2013-06-01 21:54:28.524926 osd.0 [WRN] slow request 35.248124 seconds old, recei
ved at 2013-06-01 21:53:53.276440: osd_op(mds.0.1:46 200.00000001 [write 19729~1
370] 1.6e5f474 e108) v4 currently waiting for subops from [2]
2013-06-01 21:54:28.524930 osd.0 [WRN] slow request 35.247980 seconds old, recei
ved at 2013-06-01 21:53:53.276584: osd_op(mds.0.1:47 200.00000000 [writefull 0~8
4] 1.844f3494 e108) v4 currently waiting for subops from [1]
2013-06-01 21:54:28.524933 osd.0 [WRN] slow request 30.616928 seconds old, recei
ved at 2013-06-01 21:53:57.907636: osd_op(mds.0.1:48 200.00000001 [write 21099~1
370] 1.6e5f474 e108) v4 currently waiting for subops from [2]
2013-06-01 21:54:46.385838 mon.1 [INF] mon.ceph2 calling new monitor election
2013-06-01 21:54:48.173551 mon.0 [INF] monmap e10: 3 mons at {ceph1=192.168.0.14
1:6789/0,ceph2=192.168.0.142:6789/0,ceph3=192.168.0.143:6789/0}
2013-06-01 21:54:22.156486 osd.2 [WRN] slow request 30.791905 seconds old, recei
ved at 2013-06-01 21:53:51.364396: osd_op(client.4881.1:42 10000000002.00000027 
[write 0~4194304] 2.aa518026 snapc 1=[] e113) currently waiting for subops from 
[0]
2013-06-01 21:54:45.534300 osd.0 [WRN] 2 slow requests, 2 included below; oldest
 blocked for > 30.400293 secs
2013-06-01 21:54:45.534315 osd.0 [WRN] slow request 30.400293 seconds old, recei
ved at 2013-06-01 21:54:15.133932: osd_op(client.4881.1:55 10000000002.00000034 
[write 0~4194304] 2.26c3f6a5 snapc 1=[] e115) currently waiting for subops from 
[2]
2013-06-01 21:54:46.168535 osd.2 [WRN] 2 slow requests, 2 included below; oldest
 blocked for > 30.921781 secs
2013-06-01 21:54:46.168540 osd.2 [WRN] slow request 30.921781 seconds old, recei
ved at 2013-06-01 21:54:15.246689: osd_op(client.4881.1:46 10000000002.0000002b 
[write 0~4194304] 2.6222d912 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:54:49.413695 mon.0 [INF] pgmap v376: 192 pgs: 192 active+clean; 17
7 MB data, 480 MB used, 20990 MB / 21470 MB avail; 614KB/s wr, 0op/s
2013-06-01 21:54:46.168550 osd.2 [WRN] slow request 30.692047 seconds old, recei
ved at 2013-06-01 21:54:15.476423: osd_op(client.4881.1:51 10000000002.00000030 
[write 0~4194304] 2.556ca992 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:54:51.418630 mon.0 [INF] pgmap v377: 192 pgs: 192 active+clean; 24
9 MB data, 664 MB used, 20806 MB / 21470 MB avail; 14872KB/s wr, 4op/s
2013-06-01 21:54:52.757007 osd.1 [WRN] 6 slow requests, 6 included below; oldest
 blocked for > 32.229046 secs
2013-06-01 21:54:52.842727 osd.1 [WRN] slow request 32.229046 seconds old, recei
ved at 2013-06-01 21:54:20.527622: osd_op(client.4881.1:79 10000000002.0000004c 
[write 0~4194304] 2.f14dcca0 snapc 1=[] e115) currently waiting for subops from 
[0]
2013-06-01 21:54:52.842752 osd.1 [WRN] slow request 32.118191 seconds old, recei
ved at 2013-06-01 21:54:20.638477: osd_op(client.4881.1:82 10000000002.0000004f 
[write 0~4194304] 2.4f0e20ab snapc 1=[] e115) currently waiting for subops from 
[0]
2013-06-01 21:54:52.842755 osd.1 [WRN] slow request 32.031802 seconds old, recei
ved at 2013-06-01 21:54:20.724866: osd_op(client.4881.1:85 10000000002.00000052 
[write 0~4194304] 2.c70117f6 snapc 1=[] e115) currently waiting for subops from 
[0]
2013-06-01 21:54:52.842759 osd.1 [WRN] slow request 31.955054 seconds old, recei
ved at 2013-06-01 21:54:20.801614: osd_op(client.4881.1:93 10000000002.0000005a 
[write 0~4194304] 2.f74311e0 snapc 1=[] e115) currently waiting for subops from 
[0]
2013-06-01 21:54:52.842762 osd.1 [WRN] slow request 31.886446 seconds old, recei
ved at 2013-06-01 21:54:20.870222: osd_op(client.4881.1:97 10000000002.0000005e 
[write 0~4194304] 2.463098da snapc 1=[] e115) currently waiting for subops from 
[0]
2013-06-01 21:54:52.993316 osd.2 [WRN] 8 slow requests, 6 included below; oldest
 blocked for > 37.746340 secs
2013-06-01 21:54:52.993322 osd.2 [WRN] slow request 32.570671 seconds old, recei
ved at 2013-06-01 21:54:20.422358: osd_op(client.4881.1:63 10000000002.0000003c 
[write 0~4194304] 2.458f4b57 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:54:52.993325 osd.2 [WRN] slow request 32.481438 seconds old, recei
ved at 2013-06-01 21:54:20.511591: osd_op(client.4881.1:67 10000000002.00000040 
[write 0~4194304] 2.8a861c56 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:54:52.993329 osd.2 [WRN] slow request 32.392327 seconds old, recei
ved at 2013-06-01 21:54:20.600702: osd_op(client.4881.1:69 10000000002.00000042 
[write 0~4194304] 2.79c17cc8 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:54:52.993332 osd.2 [WRN] slow request 32.302898 seconds old, recei
ved at 2013-06-01 21:54:20.690131: osd_op(client.4881.1:70 10000000002.00000043 
[write 0~4194304] 2.be685304 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:54:52.993336 osd.2 [WRN] slow request 32.170180 seconds old, recei
ved at 2013-06-01 21:54:20.822849: osd_op(client.4881.1:75 10000000002.00000048 
[write 0~4194304] 2.e8bdb4b0 snapc 1=[] e115) currently waiting for subops from 
[0]
2013-06-01 21:54:53.843394 osd.1 [WRN] 8 slow requests, 5 included below; oldest
 blocked for > 33.118369 secs
2013-06-01 21:54:53.843402 osd.1 [WRN] slow request 32.712563 seconds old, recei
ved at 2013-06-01 21:54:21.130672: osd_op(client.4881.1:103 10000000002.00000064
 [write 0~4194304] 2.f641478d snapc 1=[] e115) currently waiting for subops from
 [0]
2013-06-01 21:54:53.843494 osd.1 [WRN] slow request 32.644940 seconds old, recei
ved at 2013-06-01 21:54:21.198295: osd_op(client.4881.1:107 10000000002.00000068
 [write 0~4194304] 2.3b6fbaff snapc 1=[] e115) currently waiting for subops from
 [0]
2013-06-01 21:54:53.843497 osd.1 [WRN] slow request 32.500799 seconds old, recei
ved at 2013-06-01 21:54:21.342436: osd_op(client.4881.1:115 10000000002.00000070
 [write 0~4194304] 2.b94e49d0 snapc 1=[] e115) currently waiting for subops from
 [2]
2013-06-01 21:54:53.843512 osd.1 [WRN] slow request 32.429365 seconds old, recei
ved at 2013-06-01 21:54:21.413870: osd_op(client.4881.1:116 10000000002.00000071
 [write 0~4194304] 2.5e447d74 snapc 1=[] e115) currently waiting for subops from
 [2]
2013-06-01 21:54:53.843516 osd.1 [WRN] slow request 32.361405 seconds old, recei
ved at 2013-06-01 21:54:21.481830: osd_op(client.4881.1:117 10000000002.00000072
 [write 0~1327472] 2.e96722b6 snapc 1=[] e115) currently waiting for subops from
 [0]
2013-06-01 21:54:53.993694 osd.2 [WRN] 13 slow requests, 6 included below; oldes
t blocked for > 38.746931 secs
2013-06-01 21:54:53.993699 osd.2 [WRN] slow request 33.102882 seconds old, recei
ved at 2013-06-01 21:54:20.890738: osd_op(client.4881.1:77 10000000002.0000004a 
[write 0~4194304] 2.f1640dbc snapc 1=[] e115) currently waiting for subops from 
[0]
2013-06-01 21:54:53.993702 osd.2 [WRN] slow request 33.024642 seconds old, recei
ved at 2013-06-01 21:54:20.968978: osd_op(client.4881.1:78 10000000002.0000004b 
[write 0~4194304] 2.ff586c03 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:54:53.993706 osd.2 [WRN] slow request 32.957674 seconds old, recei
ved at 2013-06-01 21:54:21.035946: osd_op(client.4881.1:80 10000000002.0000004d 
[write 0~4194304] 2.ce5497d snapc 1=[] e115) currently waiting for subops from [
1]
2013-06-01 21:54:53.993710 osd.2 [WRN] slow request 32.890095 seconds old, recei
ved at 2013-06-01 21:54:21.103525: osd_op(client.4881.1:86 10000000002.00000053 
[write 0~4194304] 2.53cfa888 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:54:53.993714 osd.2 [WRN] slow request 32.794783 seconds old, recei
ved at 2013-06-01 21:54:21.198837: osd_op(client.4881.1:90 10000000002.00000057 
[write 0~4194304] 2.d19ea811 snapc 1=[] e115) currently waiting for subops from 
[0]
2013-06-01 21:54:54.994029 osd.2 [WRN] 18 slow requests, 6 included below; oldes
t blocked for > 39.747274 secs
2013-06-01 21:54:54.994036 osd.2 [WRN] slow request 33.724636 seconds old, recei
ved at 2013-06-01 21:54:21.269327: osd_op(client.4881.1:91 10000000002.00000058 
[write 0~4194304] 2.b1a1e2ea snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:54:54.994132 osd.2 [WRN] slow request 33.638532 seconds old, recei
ved at 2013-06-01 21:54:21.355431: osd_op(client.4881.1:95 10000000002.0000005c 
[write 0~4194304] 2.c12faf07 snapc 1=[] e115) currently waiting for subops from 
[0]
2013-06-01 21:54:54.994136 osd.2 [WRN] slow request 33.562062 seconds old, recei
ved at 2013-06-01 21:54:21.431901: osd_op(client.4881.1:96 10000000002.0000005d 
[write 0~4194304] 2.bf8d51c1 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:54:54.994139 osd.2 [WRN] slow request 33.485229 seconds old, recei
ved at 2013-06-01 21:54:21.508734: osd_op(client.4881.1:98 10000000002.0000005f 
[write 0~4194304] 2.7189ee26 snapc 1=[] e115) currently waiting for subops from 
[0]
2013-06-01 21:54:54.994144 osd.2 [WRN] slow request 33.421113 seconds old, recei
ved at 2013-06-01 21:54:21.572850: osd_op(client.4881.1:99 10000000002.00000060 
[write 0~4194304] 2.9fc73f29 snapc 1=[] e115) currently waiting for subops from 
[0]
2013-06-01 21:54:54.163824 mon.0 [INF] pgmap v378: 192 pgs: 192 active+clean; 26
1 MB data, 688 MB used, 20782 MB / 21470 MB avail; 17349KB/s wr, 4op/s
2013-06-01 21:54:55.793151 mon.0 [INF] pgmap v379: 192 pgs: 192 active+clean; 27
7 MB data, 688 MB used, 20782 MB / 21470 MB avail; 5716KB/s wr, 1op/s
2013-06-01 21:54:59.399972 mon.0 [INF] pgmap v380: 192 pgs: 192 active+clean; 29
4 MB data, 712 MB used, 20758 MB / 21470 MB avail; 7143KB/s wr, 1op/s
2013-06-01 21:54:45.534321 osd.0 [WRN] slow request 30.331751 seconds old, recei
ved at 2013-06-01 21:54:15.202474: osd_op(client.4881.1:57 10000000002.00000036 
[write 0~4194304] 2.4664e3ef snapc 1=[] e115) currently waiting for subops from 
[2]
2013-06-01 21:54:55.994491 osd.2 [WRN] 23 slow requests, 6 included below; oldes
t blocked for > 40.747727 secs
2013-06-01 21:54:55.994500 osd.2 [WRN] slow request 34.359698 seconds old, recei
ved at 2013-06-01 21:54:21.634718: osd_op(client.4881.1:102 10000000002.00000063
 [write 0~4194304] 2.3c3a0f57 snapc 1=[] e115) currently waiting for subops from
 [1]
2013-06-01 21:54:55.994503 osd.2 [WRN] slow request 34.298071 seconds old, recei
ved at 2013-06-01 21:54:21.696345: osd_op(client.4881.1:104 10000000002.00000065
 [write 0~4194304 [1@-1]] 2.b3a86c6a snapc 1=[] e115) currently no flag points r
eached
2013-06-01 21:54:55.994507 osd.2 [WRN] slow request 34.239147 seconds old, recei
ved at 2013-06-01 21:54:21.755269: osd_op(client.4881.1:105 10000000002.00000066
 [write 0~4194304 [1@-1]] 2.d5f7b770 snapc 1=[] e115) currently no flag points r
eached
2013-06-01 21:54:55.994511 osd.2 [WRN] slow request 34.178909 seconds old, recei
ved at 2013-06-01 21:54:21.815507: osd_op(client.4881.1:106 10000000002.00000067
 [write 0~4194304 [1@-1]] 2.7bcd0343 snapc 1=[] e115) currently no flag points r
eached
2013-06-01 21:54:55.994514 osd.2 [WRN] slow request 34.121846 seconds old, recei
ved at 2013-06-01 21:54:21.872570: osd_op(client.4881.1:108 10000000002.00000069
 [write 0~4194304 [1@-1]] 2.1d9eadea snapc 1=[] e115) currently no flag points r
eached
2013-06-01 21:54:56.995341 osd.2 [WRN] 27 slow requests, 5 included below; oldes
t blocked for > 41.748550 secs
2013-06-01 21:54:56.995359 osd.2 [WRN] slow request 35.061117 seconds old, recei
ved at 2013-06-01 21:54:21.934122: osd_op(client.4881.1:110 10000000002.0000006b
 [write 0~4194304 [1@-1]] 2.c38e32d1 snapc 1=[] e115) currently no flag points r
eached
2013-06-01 21:54:56.995363 osd.2 [WRN] slow request 35.004176 seconds old, recei
ved at 2013-06-01 21:54:21.991063: osd_op(client.4881.1:111 10000000002.0000006c
 [write 0~4194304 [1@-1]] 2.809a4397 snapc 1=[] e115) currently no flag points r
eached
2013-06-01 21:54:56.995369 osd.2 [WRN] slow request 34.942325 seconds old, recei
ved at 2013-06-01 21:54:22.052914: osd_op(client.4881.1:112 10000000002.0000006d
 [write 0~4194304 [1@-1]] 2.c594f014 snapc 1=[] e115) currently no flag points r
eached
2013-06-01 21:54:56.995372 osd.2 [WRN] slow request 34.840243 seconds old, recei
ved at 2013-06-01 21:54:22.154996: osd_op(client.4881.1:113 10000000002.0000006e
 [write 0~4194304 [1@-1]] 2.aec3d5c0 snapc 1=[] e115) currently no flag points r
eached
2013-06-01 21:54:56.995381 osd.2 [WRN] slow request 34.753019 seconds old, recei
ved at 2013-06-01 21:54:22.242220: osd_op(client.4881.1:114 10000000002.0000006f
 [write 0~4194304 [1@-1]] 2.61b14808 snapc 1=[] e115) currently no flag points r
eached
2013-06-01 21:55:01.554172 osd.0 [WRN] 6 slow requests, 6 included below; oldest
 blocked for > 41.410851 secs
2013-06-01 21:55:01.554177 osd.0 [WRN] slow request 41.410851 seconds old, recei
ved at 2013-06-01 21:54:20.143268: osd_op(client.4881.1:59 10000000002.00000038 
[write 0~4194304] 2.6451f29c snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:55:01.554181 osd.0 [WRN] slow request 41.317519 seconds old, recei
ved at 2013-06-01 21:54:20.236600: osd_op(client.4881.1:64 10000000002.0000003d 
[write 0~4194304] 2.f273d502 snapc 1=[] e115) currently waiting for subops from 
[2]
2013-06-01 21:55:01.554184 osd.0 [WRN] slow request 41.215199 seconds old, recei
ved at 2013-06-01 21:54:20.338920: osd_op(client.4881.1:65 10000000002.0000003e 
[write 0~4194304] 2.f91581cb snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:55:01.554211 osd.0 [WRN] slow request 41.165087 seconds old, recei
ved at 2013-06-01 21:54:20.389032: osd_op(client.4881.1:66 10000000002.0000003f 
[write 0~4194304] 2.a13f4ee8 snapc 1=[] e115) currently waiting for subops from 
[2]
2013-06-01 21:54:58.850674 osd.1 [WRN] 9 slow requests, 1 included below; oldest
 blocked for > 38.125687 secs
2013-06-01 21:54:58.850701 osd.1 [WRN] slow request 30.021546 seconds old, recei
ved at 2013-06-01 21:54:28.829007: osd_sub_op(client.4881.1:67 2.16 8a861c56/100
00000002.00000040/head//2 [] v 115'1 snapset=0=[]:[] snapc=0=[]) v7 currently st
arted
2013-06-01 21:55:01.852115 osd.1 [WRN] 11 slow requests, 2 included below; oldes
t blocked for > 41.127179 secs
2013-06-01 21:55:01.852142 osd.1 [WRN] slow request 30.449793 seconds old, recei
ved at 2013-06-01 21:54:31.402252: osd_sub_op(client.4881.1:70 2.4 be685304/1000
0000002.00000043/head//2 [] v 115'1 snapset=0=[]:[] snapc=0=[]) v7 currently sta
rted
2013-06-01 21:55:01.852146 osd.1 [WRN] slow request 30.355312 seconds old, recei
ved at 2013-06-01 21:54:31.496733: osd_sub_op(client.4881.1:69 2.8 79c17cc8/1000
0000002.00000042/head//2 [] v 115'3 snapset=0=[]:[] snapc=0=[]) v7 currently sta
rted
2013-06-01 21:55:00.784694 mon.0 [INF] pgmap v381: 192 pgs: 192 active+clean; 29
4 MB data, 724 MB used, 20746 MB / 21470 MB avail; 3459KB/s wr, 0op/s
2013-06-01 21:55:01.554216 osd.0 [WRN] slow request 41.097331 seconds old, recei
ved at 2013-06-01 21:54:20.456788: osd_op(client.4881.1:68 10000000002.00000041 
[write 0~4194304] 2.413c920b snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:55:02.554560 osd.0 [WRN] 7 slow requests, 6 included below; oldest
 blocked for > 42.097696 secs
2013-06-01 21:55:02.554566 osd.0 [WRN] slow request 41.990740 seconds old, recei
ved at 2013-06-01 21:54:20.563744: osd_op(client.4881.1:73 10000000002.00000046 
[write 0~4194304] 2.19965a53 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:55:02.554569 osd.0 [WRN] slow request 41.908160 seconds old, recei
ved at 2013-06-01 21:54:20.646324: osd_op(client.4881.1:83 10000000002.00000050 
[write 0~4194304] 2.66e8524b snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:55:02.554579 osd.0 [WRN] slow request 41.822704 seconds old, recei
ved at 2013-06-01 21:54:20.731780: osd_op(client.4881.1:84 10000000002.00000051 
[write 0~4194304] 2.20754a55 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:55:02.554583 osd.0 [WRN] slow request 41.690332 seconds old, recei
ved at 2013-06-01 21:54:20.864152: osd_op(client.4881.1:87 10000000002.00000054 
[write 0~4194304] 2.721fa5ba snapc 1=[] e115) currently waiting for subops from 
[2]
2013-06-01 21:55:02.554588 osd.0 [WRN] slow request 41.537871 seconds old, recei
ved at 2013-06-01 21:54:21.016613: osd_op(client.4881.1:88 10000000002.00000055 
[write 0~4194304] 2.b5190a31 snapc 1=[] e115) currently waiting for subops from 
[2]
2013-06-01 21:55:03.554890 osd.0 [WRN] 12 slow requests, 6 included below; oldes
t blocked for > 43.098017 secs
2013-06-01 21:55:03.554895 osd.0 [WRN] slow request 42.384882 seconds old, recei
ved at 2013-06-01 21:54:21.169923: osd_op(client.4881.1:89 10000000002.00000056 
[write 0~4194304] 2.abea6ed5 snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:55:03.554899 osd.0 [WRN] slow request 42.121662 seconds old, recei
ved at 2013-06-01 21:54:21.433143: osd_op(client.4881.1:92 10000000002.00000059 
[write 0~4194304] 2.8b72855c snapc 1=[] e115) currently waiting for subops from 
[1]
2013-06-01 21:55:03.554902 osd.0 [WRN] slow request 41.992901 seconds old, recei
ved at 2013-06-01 21:54:21.561904: osd_op(client.4881.1:94 10000000002.0000005b 
[write 0~4194304] 2.2c854145 snapc 1=[] e115) currently waiting for subops from 
[2]
2013-06-01 21:55:03.554905 osd.0 [WRN] slow request 41.923166 seconds old, recei
ved at 2013-06-01 21:54:21.631639: osd_op(client.4881.1:101 10000000002.00000062
 [write 0~4194304] 2.4fdaf87b snapc 1=[] e115) currently waiting for subops from
 [1]
2013-06-01 21:55:04.626345 mon.0 [INF] pgmap v382: 192 pgs: 192 active+clean; 30
2 MB data, 773 MB used, 20697 MB / 21470 MB avail; 1600KB/s wr, 0op/s
2013-06-01 21:55:03.554909 osd.0 [WRN] slow request 45.275815 seconds old, recei
ved at 2013-06-01 21:54:18.278990: osd_op(mds.0.1:49 200.00000001 [write 22469~1
370] 1.6e5f474 e115) v4 currently waiting for subops from [2]
2013-06-01 21:55:04.555512 osd.0 [WRN] 15 slow requests, 4 included below; oldes
t blocked for > 44.098609 secs
2013-06-01 21:55:04.555518 osd.0 [WRN] slow request 46.276297 seconds old, recei
ved at 2013-06-01 21:54:18.279100: osd_op(mds.0.1:50 200.00000000 [writefull 0~8
4] 1.844f3494 e115) v4 currently waiting for subops from [1]
2013-06-01 21:55:04.555522 osd.0 [WRN] slow request 36.275566 seconds old, recei
ved at 2013-06-01 21:54:28.279831: osd_op(mds.0.1:51 200.00000001 [write 23839~1
370] 1.6e5f474 e115) v4 currently waiting for subops from [2]
2013-06-01 21:55:04.555539 osd.0 [WRN] slow request 33.295762 seconds old, recei
ved at 2013-06-01 21:54:31.259635: osd_sub_op(client.4881.1:103 2.d f641478d/100
00000002.00000064/head//2 [] v 115'2 snapset=0=[]:[] snapc=0=[]) v7 currently st
arted
2013-06-01 21:55:04.555562 osd.0 [WRN] slow request 32.091846 seconds old, recei
ved at 2013-06-01 21:54:32.463551: osd_sub_op(client.4881.1:107 2.3f 3b6fbaff/10
000000002.00000068/head//2 [] v 115'2 snapset=0=[]:[] snapc=0=[]) v7 currently s
tarted
2013-06-01 21:55:06.306996 mon.0 [INF] pgmap v383: 192 pgs: 192 active+clean; 30
2 MB data, 796 MB used, 20673 MB / 21470 MB avail; 1508KB/s wr, 0op/s
2013-06-01 21:55:08.551511 mon.0 [INF] pgmap v384: 192 pgs: 192 active+clean; 34
6 MB data, 796 MB used, 20673 MB / 21470 MB avail; 11537KB/s wr, 3op/s
2013-06-01 21:55:10.485809 mon.0 [INF] pgmap v385: 192 pgs: 192 active+clean; 34
6 MB data, 822 MB used, 20648 MB / 21470 MB avail; 10974KB/s wr, 3op/s
2013-06-01 21:55:12.118394 mon.0 [INF] pgmap v386: 192 pgs: 192 active+clean; 35
0 MB data, 842 MB used, 20628 MB / 21470 MB avail; 942KB/s wr, 0op/s
2013-06-01 21:55:13.646551 mon.0 [INF] pgmap v387: 192 pgs: 192 active+clean; 36
2 MB data, 867 MB used, 20603 MB / 21470 MB avail; 5021KB/s wr, 1op/s
2013-06-01 21:55:15.333035 mon.0 [INF] pgmap v388: 192 pgs: 192 active+clean; 37
0 MB data, 875 MB used, 20595 MB / 21470 MB avail; 6529KB/s wr, 1op/s
2013-06-01 21:55:18.564221 osd.0 [WRN] 6 slow requests, 2 included below; oldest
 blocked for > 57.547540 secs
2013-06-01 21:55:18.564241 osd.0 [WRN] slow request 60.285163 seconds old, recei
ved at 2013-06-01 21:54:18.278990: osd_op(mds.0.1:49 200.00000001 [write 22469~1
370] 1.6e5f474 e115) v4 currently waiting for subops from [2]
2013-06-01 21:55:18.564245 osd.0 [WRN] slow request 60.285053 seconds old, recei
ved at 2013-06-01 21:54:18.279100: osd_op(mds.0.1:50 200.00000000 [writefull 0~8
4] 1.844f3494 e115) v4 currently waiting for subops from [1]
2013-06-01 21:55:21.565182 osd.0 [WRN] 3 slow requests, 1 included below; oldest
 blocked for > 60.003224 secs
2013-06-01 21:55:21.565188 osd.0 [WRN] slow request 60.003224 seconds old, recei
ved at 2013-06-01 21:54:21.561904: osd_op(client.4881.1:94 10000000002.0000005b 
[write 0~4194304] 2.2c854145 snapc 1=[] e115) currently waiting for subops from 
[2]
2013-06-01 21:55:22.011305 osd.2 [WRN] 7 slow requests, 4 included below; oldest
 blocked for > 60.255982 secs
2013-06-01 21:55:22.011310 osd.2 [WRN] slow request 60.255982 seconds old, recei
ved at 2013-06-01 21:54:21.755269: osd_op(client.4881.1:105 10000000002.00000066
 [write 0~4194304] 2.d5f7b770 snapc 1=[] e115) currently waiting for subops from
 [0]
2013-06-01 21:55:22.011332 osd.2 [WRN] slow request 60.138681 seconds old, recei
ved at 2013-06-01 21:54:21.872570: osd_op(client.4881.1:108 10000000002.00000069
 [write 0~4194304] 2.1d9eadea snapc 1=[] e115) currently waiting for subops from
 [1]
2013-06-01 21:55:22.011336 osd.2 [WRN] slow request 60.077129 seconds old, recei
ved at 2013-06-01 21:54:21.934122: osd_op(client.4881.1:110 10000000002.0000006b
 [write 0~4194304] 2.c38e32d1 snapc 1=[] e115) currently waiting for subops from
 [0]
2013-06-01 21:55:22.011339 osd.2 [WRN] slow request 60.020188 seconds old, recei
ved at 2013-06-01 21:54:21.991063: osd_op(client.4881.1:111 10000000002.0000006c
 [write 0~4194304] 2.809a4397 snapc 1=[] e115) currently waiting for subops from
 [1]
2013-06-01 21:55:30.863074 mon.0 [INF] pgmap v389: 192 pgs: 192 active+clean; 41
4 MB data, 895 MB used, 20575 MB / 21470 MB avail; 5510KB/s wr, 1op/s
2013-06-01 21:55:23.011645 osd.2 [WRN] 7 slow requests, 3 included below; oldest
 blocked for > 61.256314 secs
2013-06-01 21:55:23.011651 osd.2 [WRN] slow request 60.958669 seconds old, recei
ved at 2013-06-01 21:54:22.052914: osd_op(client.4881.1:112 10000000002.0000006d
 [write 0~4194304] 2.c594f014 snapc 1=[] e115) currently waiting for subops from
 [0]
2013-06-01 21:55:23.011702 osd.2 [WRN] slow request 60.856587 seconds old, recei
ved at 2013-06-01 21:54:22.154996: osd_op(client.4881.1:113 10000000002.0000006e
 [write 0~4194304] 2.aec3d5c0 snapc 1=[] e115) currently waiting for subops from
 [1]
2013-06-01 21:55:23.011708 osd.2 [WRN] slow request 60.769363 seconds old, recei
ved at 2013-06-01 21:54:22.242220: osd_op(client.4881.1:114 10000000002.0000006f
 [write 0~4194304] 2.61b14808 snapc 1=[] e115) currently waiting for subops from
 [1]
2013-06-01 21:55:31.895803 mon.0 [INF] pgmap v390: 192 pgs: 192 active+clean; 45
8 MB data, 1020 MB used, 20450 MB / 21470 MB avail; 5182KB/s wr, 1op/s
2013-06-01 21:57:28.061121 mon.0 [INF] pgmap v391: 192 pgs: 192 active+clean; 45
8 MB data, 1036 MB used, 20434 MB / 21470 MB avail; 358KB/s wr, 0op/s
2013-06-01 21:57:30.099923 mon.0 [INF] pgmap v392: 192 pgs: 192 active+clean; 45
8 MB data, 1040 MB used, 20430 MB / 21470 MB avail
2013-06-01 21:57:32.147883 mon.0 [INF] pgmap v393: 192 pgs: 192 active+clean; 45
8 MB data, 1040 MB used, 20430 MB / 21470 MB avail

The printout from the df command seems wrong for a mounted Ceph FS. I get 84M maximum space, 4MB used. Doing ls -lh of the directory on which the FS is mounted reveals about 450MB of data in place(which is the correct number).

cjp@workstation:~$ ls -lh test/
total 459M
-rw-rw-r-- 1 cjp cjp 601K Jun  1 21:46 ceph1-files.tar.gz
-rw-rw-r-- 1 cjp cjp 790K Jun  1 21:47 ceph2-files.tar.gz
-rw-r--r-- 1 cjp cjp 458M Jun  1 21:54 MST3K_102_TheRobot_vs_TheAztecMummy_xvid.avi

Mounting the Ceph file system via different Monitors works fine.

Via ceph1:

sudo mount.ceph 192.168.0.141:/ /home/cjp/test -v -o name=admin,secretfile=/home/cjp/ceph.client.admin.keyring 

Via ceph3:

sudo mount.ceph 192.168.0.143:/ /home/cjp/test -v -o name=admin,secretfile=/home/cjp/ceph.client.admin.keyring 

To my astonishment there were no problems writing files to the Ceph FS even after I took down the mon process on ceph3, through which I can mounted the filesystem most recently. I would have thought I would be required to remount the filesystem via ceph1 or ceph2!

ceph@ceph1:~/ceph-deploy$ sudo ceph health
HEALTH_WARN 1 mons down, quorum 0,1 ceph1,ceph2

Yet the new file is in place:

cjp@workstation:~$ ls -lh test/
total 865M
-rw-rw-r-- 1 cjp cjp 601K Jun  1 21:46 ceph1-files.tar.gz
-rw-rw-r-- 1 cjp cjp 790K Jun  1 21:47 ceph2-files.tar.gz
-rw-r--r-- 1 cjp cjp 406M Jun  1 22:06 MST3K_101_The_Crawling_Eye_xvid.avi
-rw-r--r-- 1 cjp cjp 458M Jun  1 21:54 MST3K_102_TheRobot_vs_TheAztecMummy_xvid.avi

Adding a new OSD

ceph@ceph1:~/ceph-deploy$ ./ceph-deploy install --dev=wip-ceph-tool ceph4
OK
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk list ceph4
/dev/sda :
 /dev/sda1 other, ext2, mounted on /boot
 /dev/sda2 other
 /dev/sda5 other, LVM2_member
/dev/sr0 other, unknown
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk zap ceph4:sdb
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk prepare ceph4:sdb
ceph@ceph1:~/ceph-deploy$ ./ceph-deploy disk activate ceph4:sdb
ceph@ceph1:~/ceph-deploy$ sudo ceph osd list
unknown command list
ceph@ceph1:~/ceph-deploy$ sudo ceph osd ls
0
1
2
3
ceph@ceph1:~/ceph-deploy$ sudo ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED
    21470K     19617K     1853K        8.63
POOLS:
    NAME         ID     USED      %USED     OBJECTS
    data         0      0         0         0
    metadata     1      54753     0         21
    rbd          2      864M      4.03      219
ceph@ceph1:~/ceph-deploy$ sudo ceph osd tree
# id    weight  type name       up/down reweight
-1      0.03998 root default
-2      0.009995                host ceph1
0       0.009995                        osd.0   up      1
-3      0.009995                host ceph2
1       0.009995                        osd.1   up      1
-4      0.009995                host ceph3
2       0.009995                        osd.2   up      1
-5      0.009995                host ceph4
3       0.009995                        osd.3   down    0
ceph@ceph4:~$ sudo /usr/bin/ceph-osd --cluster=ceph -i 3 -f
ceph@ceph1:~/ceph-deploy$ sudo ceph osd tree
2013-06-01 22:58:52.156325 7f83cc4b3700  0 -- :/2858 >> 192.168.0.143:6789/0 pip
e(0x2ee74b0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
# id    weight  type name       up/down reweight
-1      0.03998 root default
-2      0.009995                host ceph1
0       0.009995                        osd.0   up      1
-3      0.009995                host ceph2
1       0.009995                        osd.1   up      1
-4      0.009995                host ceph3
2       0.009995                        osd.2   up      1
-5      0.009995                host ceph4
3       0.009995                        osd.3   up      1
ceph@ceph1:~/ceph-deploy$ sudo ceph df
2013-06-01 23:01:03.268869 7f156fa5e700  0 -- :/2886 >> 192.168.0.143:6789/0 pip
e(0x19c04b0 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED
    28627K     26576K     2051K        7.17
POOLS:
    NAME         ID     USED      %USED     OBJECTS
    data         0      0         0         0
    metadata     1      54753     0         21
    rbd          2      864M      3.02      219

Et voilà! A look at how the chunks are stored:

ceph@ceph4:~$ du -sh /var/lib/ceph/osd/ceph-3/current/2.4_head/1000000000
10000000002.00000043__head_BE685304__2
10000000003.0000000d__head_E05B0544__2
10000000003.00000030__head_DC06E244__2
10000000003.00000063__head_517ACC04__2
ceph@ceph4:~$ du -sh /var/lib/ceph/osd/ceph-3/current/2.4_head/*
4.0M    /var/lib/ceph/osd/ceph-3/current/2.4_head/10000000002.00000043__head_BE6
85304__2
4.0M    /var/lib/ceph/osd/ceph-3/current/2.4_head/10000000003.0000000d__head_E05
B0544__2
4.0M    /var/lib/ceph/osd/ceph-3/current/2.4_head/10000000003.00000030__head_DC0
6E244__2
4.0M    /var/lib/ceph/osd/ceph-3/current/2.4_head/10000000003.00000063__head_517
ACC04__2

Copying a new media file to the now extended storage system:

ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  566M  6.5G   8% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  566M  6.5G   8% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  570M  6.5G   8% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  582M  6.5G   9% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  582M  6.5G   9% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  582M  6.5G   9% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  594M  6.5G   9% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  594M  6.5G   9% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  602M  6.5G   9% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  602M  6.5G   9% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  614M  6.4G   9% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/sdb1                        7.0G  826M  6.2G  12% /var/lib/ceph/osd/ceph-3
ceph@ceph4:~$

Bringing down ceph3 entirely:

ceph@ceph1:~/ceph-deploy$ sudo ceph health
HEALTH_WARN 79 pgs degraded; 79 pgs stuck unclean; recovery 142/688 degraded (20
.640%);  recovering 0 o/s, 15EB/s; 1/4 in osds are down; 1 mons down, quorum 0,1
 ceph1,ceph2
ceph@ceph1:~/ceph-deploy$ sudo ceph -w
   health HEALTH_WARN 79 pgs degraded; 79 pgs stuck unclean; recovery 142/688 de
graded (20.640%);  recovering 0 o/s, 15EB/s; 1/4 in osds are down; 1 mons down, 
quorum 0,1 ceph1,ceph2
   monmap e10: 3 mons at {ceph1=192.168.0.141:6789/0,ceph2=192.168.0.142:6789/0,
ceph3=192.168.0.143:6789/0}, election epoch 48, quorum 0,1 ceph1,ceph2
   osdmap e124: 4 osds: 3 up, 4 in
    pgmap v562: 192 pgs: 113 active+clean, 79 active+degraded; 1279 MB data, 272
2 MB used, 25905 MB / 28627 MB avail; 142/688 degraded (20.640%);  recovering 0 
o/s, 15EB/s
   mdsmap e849: 1/1/1 up {0=ceph2=up:active}
2013-06-01 23:19:27.580527 mon.0 [INF] pgmap v562: 192 pgs: 113 active+clean, 79
 active+degraded; 1279 MB data, 2722 MB used, 25905 MB / 28627 MB avail; 142/688
 degraded (20.640%);  recovering 0 o/s, 15EB/s

Seems ok. No data loss indicated.

Real deployment

Before using Ceph to actually store data I would need to do some testing. First I would need an OS/Ceph combination that I could trust. Ubuntu, CentOS and OpenSuse are prime candidates and the stable release of Ceph ought to be a good bet.

It will not be sufficient that I can - as now - slowly claw my way forward by going through source code, changing scripts, searching for bugs on mailing lists archives. It will be necessary to have an installation process that won't fail. As some problems with Ceph seem to be random, several installs from scratch should be carried out to make sure it's not 2/3 deployments that work.

The boundaries of what works should also be established. That is to say, how much can I change or intentionally mess up before the system no longer installs or works? It is not OK for the system to be reliant on me creating all three mons at the outset. It should also be possible to start out with one monitor and then add more. Otherwise the system isn't very robust.

Adding/restarting monitors and OSDs while running should also be tested. I should not have to blank a monitor or osd configuration to make it rejoin the cluster, like I've had to do recently. The same is true for metadata servers. Shutting down the cluster entirely and then starting it back up again should also be tested. /etc/ceph/ceph.conf may well turn out to become an issue here.

It may be that it is only by using something like Chef that this level of reliability can be obtained. ceph-deploy overtly states that is isn't completely implemented.

Finally some dry runs should be made. Creating the system, bringing it down, then back up again, adding data, more nodes etc. A harsh simulation of actual use. ANY bugs or crashes causing data loss or unavailability of data should more or less invalidate the setup. If one can find the reason for the bug and establish a better working configuration for the cluster, that's interesting but mainly academical... If a real world setup exhibits such serious flaws, one must assume the software isn't ready for live use. I would consider myself lucky to lose data in a test and would not risk actual data loss at that point by proceeding to use the technology live.

Disaster recovery

Is there a tool I can run with all the hard drives disconnected from (hypothetically) failed servers, now plugged into a separate rescue computer? Bacula has a separate utility for that I recall and ZFS has a built in import command for importing drives from an old mirror. Could I make one myself? What would be needed for me to do that? Metadata from MDS I assume. I would need to read up a great deal on how Ceph works.