πŸ„ΈπŸ…πŸ…€β€


RADOS Objects

This article answers a couple of questions I always had about RADOS objects:

  • What data and metadata do they store?
  • How to access and modify them?
  • What are the size limits and common sizes?
  • What is the limit of their names?
  • What can they do besides storing data?

Introduction

RADOS is Ceph's object store. Clients like RADOS Gateway (RGW), RADOS Block Devices (RBD), and CephFS use it to store data and provide more complex APIs.

Those complex APIs use RADOS objects as building blocks for block device images (RBD), files and directories (CephFS), and Amazon S3/OpenStack Swift-style objects. The RADOS pools used by a client not only store objects that correspond to the client-side data, but also metadata objects, for example for, directories, bucket indexes and parameters. How clients map their data representations to RADOS objects is topic of another article. This article focuses on the properties of RADOS objects and the rich API RADOS provides around them.

Related Links:

Example Code:

  • Ceph Repository: librados examples - Setup, Teardown, basic RADOS object operations
  • Ceph Repository: rados tool - Command line tool with almost the complete RADOS API. Also shows how to use the libradosstriper API, that automatically stripes data over multiple objects according to a pre-defined layout.

Other interesting places:

Names

Object names are a string of arbitrary bytes. They are the main input to CRUSH and therefore determine the OSDs that are responsible for storing them. OSDs use them to name local files or to key entries in their databases.

Their length is, by default, limited to 2048 bytes. In comparison with local filesystems (see table below) this is huge. It is configurable with the osd max object name len setting.

#+NAME:tab:fsnamelimits

FS Max length in Bytes
ext4 255
XFS 255
ZFS 255
Btrfs 255
Filesystem filename limits (From Wikipedia: Comparison of file systems)

The commit that introduced the setting has more details and rationale about this setting:

7e0aca1 2014-07-16 Sage Weil <sage@redhat.com>
osd: add config for osd_max_object_name_len = 2048 (was hard-coded at 4096)
Previously we had a hard coded limit of 4096.  Objects > 3k crash the OSD
when running on ext4, although they probably work on xfs.  But rgw only
generates objects a bit over 1024 bytes (maybe 1200 tops?), so let set a
more reasonable limit here.  2048 is a nice round number and should be
safe.

Add a test.

Fixes: #8174
Signed-off-by: Sage Weil <sage@redhat.com>

What happens if the name is too long?

The OSDs check the object names as part of its transaction / operation processing. If an object is too long, the client receive an ENAMETOOLONG reply.

Example:

./rados --pool=test put $(xxd -l $((2048/2)) -p /dev/urandom | tr -d '\n') \
	<(dd if=/dev/random count=10 bs=1M) # works
./rados --pool=test put $(xxd -l $((4096/2)) -p /dev/urandom | tr -d '\n') \
	<(dd if=/dev/random count=10 bs=1M) # File name too long

There is a second value determining the maximum length, provided by the used object store (get_max_object_name_length())

void ReplicatedPG::do_op(OpRequestRef& op)
  [..]

  // object name too long?
  unsigned max_name_len = MIN(g_conf->osd_max_object_name_len,
                              osd->osd->store->get_max_object_name_length());
  if (m->get_oid().name.size() > max_name_len) {
    dout(4) << "do_op '" << m->get_oid().name << "' is longer than "
            << max_name_len << " bytes" << dendl;
    osd->reply_op_error(op, -ENAMETOOLONG);
    return;
  }
 [..]
}

ObjectStore Mapping Implementations

The way OSDs use object names depends on the object store used. They have all in common, that they use the object name in combination with a namespace, pool id and other information, such as the snapshot name, to identify objects. This information combined is the object ID. BlueStore and KStore use RocksDB, whereas FileStore uses a filesystem to locate objects by their ID.

As mentioned before, filesystems have a filename limit of 255 bytes. Therefore, FileStore may has to go an extra mile to map object names to filesystem filenames.

FileStore

FileStore stores RADOS objects in files on a filesystem such as XFS or ext4. It limits object names to 4k. Part of the FileStore is a subsystem concerned with long filename support.

The worst case object name involves the following overhead:

  • A SHA1 hash of the object name
  • Multiple iterations in case of filename collisions
  • Xattr lookups to verify the object name to filename mapping

To run into the worst case, a filename has to be longer than the filesystem limit and objects with similar names have to be present.

The FileStore also limits the amount of files per directory by creating hierarchies of directories on demand.

Long Filenames

First of all, object names are not the only value encoded in the filename. Object names are ghobject_t objects that also contain information about snapshots, namespaces and shard ids. The function LFNIndex::lfn_generate_object_name(const ghobject_t&oid) converts ghobject_t object ids into strings for filename use.

The code that generates the filename is in LFNIndex::build_filename(). There are two possible paths:

  • The filename is short (smaller than FILENAME_PREFIX_LEN): Taken as is.
  • The filename is long: It is truncated to FILENAME_PREFIX_LEN and suffixed with:

    • The SHA-1 hash of full filename
    • i, a candidate index number passed from lfn_get_name
    • The FILENAME_COOKIE

To find objects on the filesystem a lookup searches using the name generated by LFNIndex::build_filename() and verifies by the xattr user.cephos.lfn$INDEX_VERSION (Current index version is 3).

class LFNIndex : public CollectionIndex {
  /// Hash digest output size.
  static const int FILENAME_LFN_DIGEST_SIZE = CEPH_CRYPTO_SHA1_DIGESTSIZE;
  /// Length of filename hash.
  static const int FILENAME_HASH_LEN = FILENAME_LFN_DIGEST_SIZE;
  /// Max filename size.
  static const int FILENAME_MAX_LEN = 4096;
  /// Length of hashed filename.
  static const int FILENAME_SHORT_LEN = 255;
  /// Length of hashed filename prefix.
  static const int FILENAME_PREFIX_LEN;
  /// Length of hashed filename cookie.
  static const int FILENAME_EXTRA = 4;
  /// Lfn cookie value.
  static const string FILENAME_COOKIE;
  /// Name of LFN attribute for storing full name.
  static const string LFN_ATTR;
  /// Prefix for subdir index attributes.
  static const string PHASH_ATTR_PREFIX;
  /// Prefix for index subdirectories.
  static const string SUBDIR_PREFIX;
Constants that control Ceph's long filename mapping (File: LFNIndex.h)
const int LFNIndex::FILENAME_PREFIX_LEN =  FILENAME_SHORT_LEN - FILENAME_HASH_LEN -
								FILENAME_COOKIE.size() -
								FILENAME_EXTRA;

const string LFNIndex::FILENAME_COOKIE = "long";
Constants that control Ceph's long filename mapping (File: LFNIndex.cc)

BlueStore and KStore

BlueStore and KStore share the same object mapping code. They keep the object ID / storage location mappings in a memory cache backed by RocksDB. They compute the keys from the object IDs. According to the RocksDB documentation there is no limit on the length of the keys and values.

Operations

RADOS objects are not merely data containers: The also provide synchronization, locking, and access to a key value database. The data portion has file like operations.

Data

The following operations are to the data part of the object, which can contain arbitrary byte data.

Synchronous

  • create
  • remove
  • write
  • write_full - Write full object. May overwrite if exists.
  • clone_range - Copy part of an object to another
  • append
  • read
  • truncate
  • zero - Overwrite part of the object with zeros
  • stat - Get size and MTime

Async

Most synchronous operations have asynchronous counterparts. Async operations use a callback mechanism to signal completion.

Misc

  • set_alloc_hint - Advise backend on expected object size

TMap

The TMap operations offer key/value-like access to the data part of the object. It was deprecated in favor of the Object Map API described below.

  • tmap_put
  • tmap_get
  • tmap_update
  • tmap_to_omap - Convert TMap to Object Map

Structured Data

Object Maps are basically a key/value database attached to the object name, that is not part of the regular data. Objects may have both Object Maps and regular data.

The difference between Object Maps, TMaps and Extended Attributes is subtle: See this mailinglist thread for more details.

Object Map

  • omap_get_vals
  • omap_get_keys
  • omap_get_header
  • omap_get_vals_by_keys
  • omap_set
  • omap_set_header
  • omap_clear
  • omap_rm_keys

Extended Attributes

Extended Attributes are similar to filesystem xattrs and behave much like them.

  • rmxattr
  • setxattr
  • getxattr
  • getxattrs

Locks

Advisory locks

  • lock_exclusive
  • lock_shared
  • unlock
  • break_lock
  • list_lockers

Object Classes

Object Classes define complex operations that run on the OSDs, but are accessible through librados. Many clients use them for special operations. RGW, for example, provides a method bucket\_list to list a bucket's content with one operation. Object Class Methods can be called using exec

Watches

Watch and notify mechanism.

  • watch
  • watch_check
  • notify
  • notify_ack

Compound Operations

Send multiple operations at once.

  • operate
  • aio_operate

Size

Ceph clients almost universally use a default object size of 4 MB. This is usually configured once on initialization. In RBD, for example, when creating a new image.

Most clients use the Ceph Striper API to split their data units (files, images, S3 objects) into RADOS objects. 4 MB is the default size.

Still, 4 MB is not the maximum size. It is configurable with the osd max object size setting and defaults to 100 GB. Writes beyond that limit result in an error. The commit that introduced this feature has more details:

f1b6bd7 2013-06-13 David Zafman <david.zafman@inktank.com>
osd: EINVAL from truncate causes osd to crash
Maximum object size is 100GB configurable with osd_max_object_size
Error EFBIG if attempt to WRITE/WRITEFULL/TRUNCATE beyond osd_max_object_size
Error EINVAL if length < 1 for WRITE/WRITEFULL/ZERO
Make ZERO beyond existing size a no-op

Fixes: #5252
Fixes: #5340

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>

diff --git a/src/common/config_opts.h b/src/common/config_opts.h
index 8a02dd5..094d124 100644
--- a/src/common/config_opts.h
+++ b/src/common/config_opts.h
@@ -485,6 +485,8 @@ OPTION(osd_recovery_op_priority, OPT_INT, 10)
+OPTION(osd_max_object_size, OPT_U64, 100*1024L*1024L*1024L) // OSD's maximum object size