πΈπ π β
RADOS Objects
This article answers a couple of questions I always had about RADOS objects:
- What data and metadata do they store?
- How to access and modify them?
- What are the size limits and common sizes?
- What is the limit of their names?
- What can they do besides storing data?
Introduction
RADOS is Ceph's object store. Clients like RADOS Gateway (RGW), RADOS Block Devices (RBD), and CephFS use it to store data and provide more complex APIs.
Those complex APIs use RADOS objects as building blocks for block device images (RBD), files and directories (CephFS), and Amazon S3/OpenStack Swift-style objects. The RADOS pools used by a client not only store objects that correspond to the client-side data, but also metadata objects, for example for, directories, bucket indexes and parameters. How clients map their data representations to RADOS objects is topic of another article. This article focuses on the properties of RADOS objects and the rich API RADOS provides around them.
Related Links:
Example Code:
- Ceph Repository: librados examples - Setup, Teardown, basic RADOS object operations
- Ceph Repository: rados tool - Command line tool with almost the complete RADOS API. Also shows how to use the libradosstriper API, that automatically stripes data over multiple objects according to a pre-defined layout.
Other interesting places:
- librados implementation
- Object Classes complex operations executed by the OSDs
Names
Object names are a string of arbitrary bytes. They are the main input to CRUSH and therefore determine the OSDs that are responsible for storing them. OSDs use them to name local files or to key entries in their databases.
Their length is, by default, limited to 2048 bytes. In comparison with local filesystems (see table below) this is huge. It is configurable with the osd max object name len setting.
#+NAME:tab:fsnamelimits
FS | Max length in Bytes |
ext4 | 255 |
XFS | 255 |
ZFS | 255 |
Btrfs | 255 |
The commit that introduced the setting has more details and rationale about this setting:
7e0aca1 2014-07-16 Sage Weil <sage@redhat.com> osd: add config for osd_max_object_name_len = 2048 (was hard-coded at 4096) Previously we had a hard coded limit of 4096. Objects > 3k crash the OSD when running on ext4, although they probably work on xfs. But rgw only generates objects a bit over 1024 bytes (maybe 1200 tops?), so let set a more reasonable limit here. 2048 is a nice round number and should be safe. Add a test. Fixes: #8174 Signed-off-by: Sage Weil <sage@redhat.com>
What happens if the name is too long?
The OSDs check the object names as part of its transaction / operation
processing. If an object is too long, the client receive an ENAMETOOLONG
reply.
Example:
./rados --pool=test put $(xxd -l $((2048/2)) -p /dev/urandom | tr -d '\n') \
<(dd if=/dev/random count=10 bs=1M) # works
./rados --pool=test put $(xxd -l $((4096/2)) -p /dev/urandom | tr -d '\n') \
<(dd if=/dev/random count=10 bs=1M) # File name too long
There is a second value determining the maximum length, provided by
the used object store (get_max_object_name_length()
)
void ReplicatedPG::do_op(OpRequestRef& op)
[..]
// object name too long?
unsigned max_name_len = MIN(g_conf->osd_max_object_name_len,
osd->osd->store->get_max_object_name_length());
if (m->get_oid().name.size() > max_name_len) {
dout(4) << "do_op '" << m->get_oid().name << "' is longer than "
<< max_name_len << " bytes" << dendl;
osd->reply_op_error(op, -ENAMETOOLONG);
return;
}
[..]
}
ObjectStore Mapping Implementations
The way OSDs use object names depends on the object store used. They have all in common, that they use the object name in combination with a namespace, pool id and other information, such as the snapshot name, to identify objects. This information combined is the object ID. BlueStore and KStore use RocksDB, whereas FileStore uses a filesystem to locate objects by their ID.
As mentioned before, filesystems have a filename limit of 255 bytes. Therefore, FileStore may has to go an extra mile to map object names to filesystem filenames.
FileStore
FileStore stores RADOS objects in files on a filesystem such as XFS or ext4. It limits object names to 4k. Part of the FileStore is a subsystem concerned with long filename support.
The worst case object name involves the following overhead:
- A SHA1 hash of the object name
- Multiple iterations in case of filename collisions
- Xattr lookups to verify the object name to filename mapping
To run into the worst case, a filename has to be longer than the filesystem limit and objects with similar names have to be present.
The FileStore also limits the amount of files per directory by creating hierarchies of directories on demand.
Long Filenames
First of all, object names are not the only value encoded in the
filename. Object names are ghobject_t
objects that also contain
information about snapshots, namespaces and shard ids.
The function LFNIndex::lfn_generate_object_name(const ghobject_t&oid)
converts ghobject_t
object ids into strings for filename use.
The code that generates the filename is in
LFNIndex::build_filename()
. There are two possible paths:
- The filename is short (smaller than
FILENAME_PREFIX_LEN
): Taken as is. The filename is long: It is truncated to
FILENAME_PREFIX_LEN
and suffixed with:- The SHA-1 hash of full filename
i
, a candidate index number passed fromlfn_get_name
- The
FILENAME_COOKIE
To find objects on the filesystem a lookup searches using the name
generated by LFNIndex::build_filename()
and verifies by the xattr
user.cephos.lfn$INDEX_VERSION
(Current index version is 3).
BlueStore and KStore
BlueStore and KStore share the same object mapping code. They keep the object ID / storage location mappings in a memory cache backed by RocksDB. They compute the keys from the object IDs. According to the RocksDB documentation there is no limit on the length of the keys and values.
Operations
RADOS objects are not merely data containers: The also provide synchronization, locking, and access to a key value database. The data portion has file like operations.
Data
The following operations are to the data part of the object, which can contain arbitrary byte data.
Synchronous
create
remove
write
write_full
- Write full object. May overwrite if exists.clone_range
- Copy part of an object to anotherappend
read
truncate
zero
- Overwrite part of the object with zerosstat
- Get size and MTime
Async
Most synchronous operations have asynchronous counterparts. Async operations use a callback mechanism to signal completion.
Misc
set_alloc_hint
- Advise backend on expected object size
TMap
The TMap operations offer key/value-like access to the data part of the object. It was deprecated in favor of the Object Map API described below.
tmap_put
tmap_get
tmap_update
tmap_to_omap
- Convert TMap to Object Map
Structured Data
Object Maps are basically a key/value database attached to the object name, that is not part of the regular data. Objects may have both Object Maps and regular data.
The difference between Object Maps, TMaps and Extended Attributes is subtle: See this mailinglist thread for more details.
Object Map
omap_get_vals
omap_get_keys
omap_get_header
omap_get_vals_by_keys
omap_set
omap_set_header
omap_clear
omap_rm_keys
Extended Attributes
Extended Attributes are similar to filesystem xattrs and behave much like them.
rmxattr
setxattr
getxattr
getxattrs
Locks
Advisory locks
lock_exclusive
lock_shared
unlock
break_lock
list_lockers
Object Classes
Object Classes define complex operations that run on the
OSDs, but are accessible through librados.
Many clients use them for special operations.
RGW, for example, provides a method bucket\_list to list a bucket's
content with one operation. Object Class Methods can be called using exec
Watches
Watch and notify mechanism.
watch
watch_check
notify
notify_ack
Compound Operations
Send multiple operations at once.
operate
aio_operate
Size
Ceph clients almost universally use a default object size of 4 MB. This is usually configured once on initialization. In RBD, for example, when creating a new image.
Most clients use the Ceph Striper API to split their data units (files, images, S3 objects) into RADOS objects. 4 MB is the default size.
Still, 4 MB is not the maximum size. It is configurable with the osd max object size setting and defaults to 100 GB. Writes beyond that limit result in an error. The commit that introduced this feature has more details:
f1b6bd7 2013-06-13 David Zafman <david.zafman@inktank.com> osd: EINVAL from truncate causes osd to crash Maximum object size is 100GB configurable with osd_max_object_size Error EFBIG if attempt to WRITE/WRITEFULL/TRUNCATE beyond osd_max_object_size Error EINVAL if length < 1 for WRITE/WRITEFULL/ZERO Make ZERO beyond existing size a no-op Fixes: #5252 Fixes: #5340 Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> diff --git a/src/common/config_opts.h b/src/common/config_opts.h index 8a02dd5..094d124 100644 --- a/src/common/config_opts.h +++ b/src/common/config_opts.h @@ -485,6 +485,8 @@ OPTION(osd_recovery_op_priority, OPT_INT, 10) +OPTION(osd_max_object_size, OPT_U64, 100*1024L*1024L*1024L) // OSD's maximum object size