RADOS Objects in RBD

RADOS block devices (RBD) offer block device-like access on top of RADOS. A common use case is virtual machine images, for which the IO path would look like the following:

  1. Application writes on guest filesystem
  2. Filesystem accesses a block device on a volume manager like LVM
  3. Volume manager uses a virtual block device provided by the hypervisor
  4. Hypervisor implements the virtual block device with librbd
  5. Librbd converts block device accesses into RADOS object accesses
  6. The RADOS client communicates with OSDs to store the objects
  7. The OSD uses its object store backend to persist objects
  8. The FileStore (default object store) stores objects on a filesystem and journal
  9. Filesystem and journal are on disk/SSD

RBD is available as Linux kernel modules, FUSE and in form of librbd, which many clients such as QEMU and OpenStack Cinder use.

RBD clients usually use a single pool to store multiple images. Each image is consistently named, with blocks striped across multiple objects. Along with the block data objects there are also well-known metadata objects, used for synchronization, settings, and as a directory for images.

Metadata objects

rbd_directory

List of RBD images in this pool. Implemented as an object map containing key/value pairs for id to name mappings and vice versa.

rbd_children

Used when cloning. An attached object map maps a parent image to a list of child images.

rbd_lock

Used for RBD locking operations.

rbd_pool_settings

If present contains pool specific settings. For example concerning RBD mirroring.

Images

Images come in two flavors: Old style and new style. The data and metadata object names, and the supported features differ. Here are the per-image new-style object names:

rbd_id.<NAME>

Contains the ID of the image. The name is a human readable string set when the image is created.

rbd_header.<ID>

The image header in form of an object map. Contains settings, such as, the enabled features, the prefix used for data objects, the images size, and the layout settings for striping the objects.

rbd_object_map

Optional, and only used when the object-map feature is enabled. It tracks allocation and speeds up cloning.

rbd_data.<ID>.<STRIPE>

The data objects

Data Objects

Data objects have the following format: rbd_data.ID.STRIPE. The ID is hex encoded. rbd_directory stores the mapping from ID to name. STRIPE is a zero padded hex encoded number.

By default RBD fills 4 MB objects sequentially. Changing the default requires RBD version 2 with the stripingv2 feature enabled. Images are sparse; objects only exists when non-zero blocks are in their range.

Tools

  • rbd - Command line tool, that exports most librbd operations
  • rbd-fuse - FUSE filesystem. Contains images as files.

Clients

OpenStack Cinder

OpenStack Cinder has two RADOS clients: A volume driver and a backup driver. Both use RBDs, albeit with differnet defaults.

Volume

Uses python librbd with a default chunk size of 4 MB and RBD_FEATURE_LAYERING turned on per default. (Allows fast copy-on-write image creation Ceph Documentation about rbd layering Source: Version for this article, Master Branch )

Backup

Uses the volume driver, but enables additional RBD features: RBD_FEATURE_LAYERING and RBD_FEATURE_STRIPINGV2 to set the following default striper layout Source: Version for this article, Master branch :

  • rbd_stripe_unit - 0
  • rbd_stripe_count - 0

QEMU RBD

QEMU has a build-in RBD support. Source: block/rbd.c, Git

It supports the following options:

  • BLOCK_OPT_CLUSTER_SIZE - RBD object size. Defaults to RADOS default (4 MB)
  • BLOCK_OPT_SIZE - Virtual disk size

The stripe layouts are not customizable.

Footnotes: