Garage: a S3-like object storage

Store pile of bytes in your garage.

Context

Data storage is critical: it can lead to data loss if done badly and/or on hardware failure. Filesystems + RAID can help on a single machine but a machine failure can put the whole storage offline. Moreover, it put a hard limit on scalability. Often this limit can be pushed back far away by buying expensive machines. But here we consider non specialized off the shelf machines that can be as low powered and subject to failures as a raspberry pi.

Distributed storage may help to solve both availability and scalability problems on these machines. Many solutions were proposed, they can be categorized as block storage, file storage and object storage depending on the abstraction they provide.

Block storage is the most low level one, it's like exposing your raw hard drive over the network. It requires very low latencies and stable network, that are often dedicated. However it provides disk devices that can be manipulated by the operating system with the less constraints: it can be partitioned with any filesystem, meaning that it supports even the most exotic features. We can cite iSCSI or Fibre Channel. Openstack Cinder proxy previous solution to provide an uniform API.

File storage provides a higher abstraction, they are one filesystem among others, which means they don't necessarily have all the exotic features of every filesystem. Often, they relax some POSIX constraints while many applications will still be compatible without any modification. As an example, we are able to run MariaDB (very slowly) over GlusterFS... We can also mention CephFS (read RADOS whitepaper), Lustre, LizardFS, MooseFS, etc. OpenStack Manila proxy previous solutions to provide an uniform API.

Finally object storages provide the highest level abstraction. They are the testimony that the POSIX filesystem API is not adapted to distributed filesystems. Especially, the strong concistency has been dropped in favor of eventual consistency which is way more convenient and powerful in presence of high latencies and unreliability. We often read about S3 that pioneered the concept that it's a filesystem for the WAN. Applications must be adapted to work for the desired object storage service. Today, the S3 HTTP REST API acts as a standard in the industry. However, Amazon S3 source code is not open but alternatives were proposed. We identified Minio, Pithos, Swift and Ceph. Minio/Ceph enforces a total order, so properties similar to a (relaxed) filesystem. Swift and Pithos are probably the most similar to AWS S3 with their consistent hashing ring.

There was many attempts in research too. I am only thinking to LBFS that was used as a basis for Seafile.

Quentin:

Other ideas:

Remark 1 I really like the Rabin fingerprinting approach however deduplication means we need to implement reference counting. How do we implement it? If we suppose a CRDT counter, if we do +1, +1, -1 but counter is registered as +1, -1, +1, we are at zero at one point and lost ou chunk. ---> we need to be careful in our implementation if we want to play.

Remark 2 Seafile idea has been stolen from this article: https://pdos.csail.mit.edu/papers/lbfs:sosp01/lbfs.pdf

Random notes

--> we should not talk about block. It is the abstraction that manipulate your FS to interact with your hard drive. "Chunk" is probably more appropriate. Block storage are a class of distributed storage where you expose the abstraction of your hard drive over the network, mainly SATA over ethernet, thinking to SCSI, FiberChannel, and so on

Questions à résoudre

  1. est-ce que cassandra support de mettre certaines tables sur un SSD et d'autres sur un disque rotatif ?
  2. est-ce que cassandra/scylladb a un format de table on disk qui ne s'écroule pas complètement losque tu as des gros blobs ? (les devs de sqlite ont écrit tout un article pour dire que même avec leur lib qui est quand même sacrément optimisés, ils considèrent qu'à partir de je crois 4ko c'est plus efficace de mettre les blobs dans des fichiers séparés) - https://www.sqlite.org/intern-v-extern-blob.html
  3. Quelle taille de blocs ? L'idée c'est qu'on a quand même des liens en WAN avec des débits pas forcéments incroyables. Et ça serait bien que le temps de répliquer un bloc soit de l'ordre de la seconde maxi. En cas de retry, pour pouvoir mieux monitorer la progression, etc. Exoscale utilise 16Mo. LX propose 1Mo.

Modules

Metadata tables

Objects:

Having only a hash key on the bucket name will lead to storing all file entries of this table for a specific bucket on a single node. At the same time, it is the only way I see to rapidly being able to list all bucket entries...

Blocks:

A version is defined by the existence of at least one entry in the blocks table for a certain version UUID. We must keep the following invariant: if a version exists in the blocks table, it has to be referenced in the objects table. We explicitly manage concurrent versions of an object: the version timestamp and version UUID columns are index columns, thus we may have several concurrent versions of an object. Important: before deleting an older version from the objects table, we must make sure that we did a successfull delete of the blocks of that version from the blocks table.

Thus, the workflow for reading an object is as follows:

  1. Check permissions (LDAP)
  2. Read entry in object table. If data is inline, we have its data, stop here. -> if several versions, take newest one and launch deletion of old ones in background
  3. Read first block from cluster. If size <= 1 block, stop here.
  4. Simultaneously with previous step, if size > 1 block: query the Blocks table for the IDs of the next blocks
  5. Read subsequent blocks from cluster

Workflow for PUT:

  1. Check write permission (LDAP)
  2. Select a new version UUID
  3. Write a preliminary entry for the new version in the objects table with complete = false
  4. Send blocks to cluster and write entries in the blocks table
  5. Update the version with complete = true and all of the accurate information (size, etc)
  6. Return success to the user
  7. Launch a background job to check and delete older versions

Workflow for DELETE:

  1. Check write permission (LDAP)
  2. Get current version (or versions) in object table
  3. Do the deletion of those versions NOT IN A BACKGROUND JOB THIS TIME
  4. Return succes to the user if we were able to delete blocks from the blocks table and entries from the object table

To delete a version:

  1. List the blocks from Cassandra
  2. For each block, delete it from cluster. Don't care if some deletions fail, we can do GC.
  3. Delete all of the blocks from the blocks table
  4. Finally, delete the version from the objects table

Known issue: if someone is reading from a version that we want to delete and the object is big, the read might be interrupted. I think it is ok to leave it like this, we just cut the connection if data disappears during a read.

("Soit P un problème, on s'en fout est une solution à ce problème")

Block storage on disk

Blocks themselves:

Reverse index for GC & other block-level metadata:

Usefull metadata:

Write strategy: have a single thread that does all write IO so that it is serialized (or have several threads that manage independent parts of the hash space). When writing a blob, write it to a temporary file, close, then rename so that a concurrent read gets a consistent result (either not found or found with whole content).

Read strategy: the only read operation is get(hash) that returns either the data or not found (can do a corruption check as well and return corrupted state if it is the case). Can be done concurrently with writes.

Internal API:

GC: when last ref is deleted, delete block. Long GC procedure: check in Cassandra that version UUIDs still exist and references this block.

Rebalancing: takes as argument the list of newly added nodes.

Only one balancing can be running at a same time. It can be restarted at the beginning with new parameters.

Membership management

Two sets of nodes:

Thus, three states for nodes:

Membership messages between nodes:

Ring: generated from the desired set of nodes, however when doing read/writes on the ring, skip nodes that are known to be not pingable. The tokens are generated in a deterministic fashion from node IDs (hash of node id + token number from 1 to K). Number K of tokens per node: decided by the operator & stored in the operator's list of nodes CRDT. Default value proposal: with node status information also broadcast disk total size and free space, and propose a default number of tokens equal to 80%Free space / 10Gb. (this is all user interface)

Constants