RADOS block devices (RBD) offer block device-like access on top of RADOS. A common use case is virtual machine images, for which the IO path would look like the following:
- Application writes on guest filesystem
- Filesystem accesses a block device on a volume manager like LVM
- Volume manager uses a virtual block device provided by the hypervisor
- Hypervisor implements the virtual block device with librbd
- Librbd converts block device accesses into RADOS object accesses
- The RADOS client communicates with OSDs to store the objects
- The OSD uses its object store backend to persist objects
- The FileStore (default object store) stores objects on a filesystem and journal
- Filesystem and journal are on disk/SSD
RBD is available as Linux kernel modules, FUSE and in form of librbd, which many clients such as QEMU and OpenStack Cinder use.
RBD clients usually use a single pool to store multiple images. Each image is consistently named, with blocks striped across multiple objects. Along with the block data objects there are also well-known metadata objects, used for synchronization, settings, and as a directory for images.
List of RBD images in this pool. Implemented as an object map containing key/value pairs for id to name mappings and vice versa.
Used when cloning. An attached object map maps a parent image to a list of child images.
Used for RBD locking operations.
If present contains pool specific settings. For example concerning RBD mirroring.
Images come in two flavors: Old style and new style. The data and metadata object names, and the supported features differ. Here are the per-image new-style object names:
Contains the ID of the image. The name is a human readable string set when the image is created.
The image header in form of an object map. Contains settings, such as, the enabled features, the prefix used for data objects, the images size, and the layout settings for striping the objects.
Optional, and only used when the object-map feature is enabled. It tracks allocation and speeds up cloning.
The data objects
Data objects have the following format:
ID is hex encoded.
rbd_directory stores the mapping from ID
STRIPE is a zero padded hex encoded number.
By default RBD fills 4 MB objects sequentially. Changing the default requires RBD version 2 with the stripingv2 feature enabled. Images are sparse; objects only exists when non-zero blocks are in their range.
rbd- Command line tool, that exports most librbd operations
rbd-fuse- FUSE filesystem. Contains images as files.
OpenStack Cinder has two RADOS clients: A volume driver and a backup driver. Both use RBDs, albeit with different defaults.
Uses the volume driver, but enables additional RBD features:
RBD_FEATURE_STRIPINGV2 to set the
following default striper layout3:
QEMU has a built-in RBD support.4
It supports the following options:
BLOCK_OPT_CLUSTER_SIZE- RBD object size. Defaults to RADOS default (4 MB)
BLOCK_OPT_SIZE- Virtual disk size
The stripe layouts are not customizable.