Oct 11 2014 OpenStack Swift Ring made understandable

When people talk about OpenStack Swift, we often hear the word Ring. This is because the Ring is a central piece in how Swift is working. But what is this thing everyone's talking about ?

The Ring refers to 3 files that are shared among every Swift nodes (storage and proxy nodes):

object.ring.gz
container.ring.gz
account.ring.gz

There is actually one ring per type of data manipulated by Swift: Objects, Containers and Accounts. These files determine on which physical devices (hard disks) will be stored each object (and also each container and account). The number of devices on which an object is stored depends on the number of replicas (copies) specified for the Swift cluster.

How does it concretely work

When receiving an object to store, Swift computes a (MD5) hash of the object's full name (including its account's and container's name). A part of this hash is kept and interpreted by Swift as the partition number. The length of the hash segment kept depends on the number of partitions that has been set in the Swift cluster; This number is necessarily a power of 2. So that if we keep n bits of the hash, we have 2^n partitions.

The object ring is a map that associates each partition to a specific physical device. This mechanism is then repeated for every object's replicas, and also for containers and accounts.

To be more specific, the object's ring has 3 components:

What is referred in the code as the _replica2part2dev table (which name is relatively explicit as we'll see later on)
The table of devices describing each device
The length of an object's hash to consider as the partition number

The _replica2part2dev structure is a 2-dimensional table, so that for any (replica number, partition number) couple, the table indicates the physical device, where the object should be stored.

The devices table contains every information that a Swift node needs in order to reach a given device; It consists mostly in the device's storage node's IP address, the TCP port to use, and the physical device name on the storage node.

In the end, the Ring is composed of 2 tables and one integer. If I were to choose a name for such structure, I would call it the Table. I couldn't find any explanation of why the name Ring was adopted, but my guess is that some previous algorithm may have used some modular computation, which people tend to represent using rings..

Example

Here is a simple example to make everything clear. Let's consider a Swift cluster with 2 storage nodes, with the following IPs addresses: 192.168.0.10 and 192.168.0.11. Each storage node has two devices: sdb1 and sdc1.

An example of _replica2part2dev table with 3 replicas, 8 partitions and 4 devices would be:

r
e  |   +-----------------+
p  | 0 | 0 1 2 3 0 1 2 3 |
l  | 1 | 1 2 3 0 1 2 3 0 |
i  | 2 | 2 3 0 1 2 3 0 1 |
c  v   +-----------------+
a        0 1 2 3 4 5 6 7
       ------------------>
           partition

The table has 3 lines, one for each replica, and 8 columns, one for each partition. To find the device storing the replica number 1 of partition number 2, we select the line of index 1 and column of index 2. This lead us to the device ID 3.

The devices table is very similar to what we can obtain by using the swift-ring-builder with only the builder file as argument:

$ swift-ring-builder mybuilder 
mybuilder, build version 4
8 partitions, 3.000000 replicas, 1 regions, 1 zones, 4 devices, 0.00 balance
The minimum number of hours before a partition can be reassigned is 0
Devices:    id  region  zone      ip address  port  replication ip  replication port      name weight partitions balance meta
             0       1     1    192.168.0.10  6000    192.168.0.10              6000      sdb1 100.00          6    0.00 
             1       1     1    192.168.0.10  6000    192.168.0.10              6000      sdc1 100.00          6    0.00 
             2       1     1    192.168.0.11  6000    192.168.0.11              6000      sdb1 100.00          6    0.00 
             3       1     1    192.168.0.11  6000    192.168.0.11              6000      sdc1 100.00          6    0.00

The device of ID 3 can be found on server 192.168.0.11, port 6000, device name sdc1.

Simple is good

What I like with such mechanism is that the smartness of the data placement is performed by the swift-ring-builder, a standalone tool provided with Swift. Once the rings have been built, Swift processes running on the Swift nodes have a fully deterministic and easily predictable behavior.

The swift-ring-builder manipulates builders files; these are files containing architectural information about the Swift cluster (like distribution of devices and nodes among regions and zones). These builders are then used to generate the rings files. As with the rings, there is one builder per type of data (objects, containers and accounts).

Thanks to this mechanism the complexity of smartly storing objects has been well separated between:

Smartly assigning partitions (and corresponding objects, containers and accounts) to devices, taking into account the cluster's architecture. This is performed by the swift-ring-builder
Ensuring that files are stored uncorrupted at the appropriate locations; This is performer by the processes running on the Swift nodes.

For more information about the ring, one can read the Swift's developer documentation about the Ring.

How does it concretely work

Example

Simple is good

More