(This is not quite a “Dear lazyweb” post. If anything, it's “Dear lazyweb, I've done my homework, now what?”)
I'm looking for the ultimate distributed filesystem. Something that's simple to use, redundant, fault-tolerant, and still smart enough to avoid the more obvious performance chokepoints. Ideally, it should work for all of the following sets of files:
- Music files. They're currently stored on a plug computer at home and served over DAAP, but DAAP seems to become less and less of a priority for music players (I'll keep the “music players suck” rant for another post), and it has its inconveniences too. Ideally, the files should be integrated in the filesystem of my work computer, the one that serves as a media player in the living-room, the one I use for playing the backing tracks when I'm practising the drums, and the laptop I use when on the road too.
- Backups. My BackupPC currently runs on the plug computer at home, so backups of my remote servers are in a geographically distinct location; but backups of my home computers are not, and I'd like to change that without having to copy files around by hand. Backups do contain confidential data, so this adds a constraint on the filesystem.
- Bazaar repositories. I do have a script that pushes and pulls stuff across computers for the many repositories I use, but it's still awkward that I need to do that.
- Parts of my home directory, such as my browser preferences and bookmarks, or the database of the Hamster time tracking tool, or the working directories of stuff I do for clients, and so on. Again, each of those can be done with specific means (bookmarks synced to a server, CouchDB-like database, DVCS and regular commits, etc.), but wouldn't it be much simpler if there only was one file to begin with?
The requirements on the ultimate distributed filesystem (which I'll call UDFS for short, otherwise you'll get bored and go look at pictures of kittens) are as follows:
- Availability means redundancy: some of the storage nodes will be on dedicated servers in datacenters, others at home; I can imagine setting up the firewall so that the home computers are reachable from outside, but sometimes network links go down, and the home computers are far from being on 100% of the time.
- Availability/redundancy also means automated replication and rebalancing: if a node is added to the “grid” (or switched on), it should automatically get its share of the files so as to contribute to availability if another node goes down at a later time.
- Confidentiality: at the very least, network communications must be encrypted and authenticated; ideally, individual storage nodes wouldn't need to be able to access the stored data. If I store bits of my backups on a friend's server, I don't want to have to trust them not to look at the data; also, my friend may actually want to be unable to look at my data (to provide for deniability in case someone else wants to look at it).
- Performance: native disk performance may not be realistically reachable, but the system must be smart enough (or configurable enough) to store files on both sides of the ADSL link, for instance, so not all accesses need to go through the bottleneck.
- Integration with the system: I want a filesystem, not a storage system. All applications know how to navigate a mounted filesystem; very few will interface with an specific application designed to store and fetch chunks of data.
- Scalability would be nice, although my personal needs are still well below the terabyte range and I can't see myself using more than a dozen nodes or so.
Quite a few constraints, eh? I have a feeling they are not mutually incompatible though, so I had a look at several candidates. I started with Wikipedia, and I followed the links to Tahoe-LAFS, XtreemFS and Ceph. The following is my evaluation of these candidates based on much reading of docs and websites and wikis, some questioning on IRC, and very little testing.
- Availability/redundancy 1: all three candidates work on the net, and they all provide for data replication. XtreemFS seems to operate on a “fall-back” mode while the other two are more distributed (meaning there's no canonical node hosting any particular file). Tahoe uses erasure codes (files can be split across N shares, k of which are enough to reconstruct the whole file; the N/k ratio controls the amount of redundancy); it seems to require the “introducer” node to be always up, which introduces a single point of failure, but this node can be replaced with no loss of data if it fails (this just requires reconfiguring the storage nodes, which can probably be automated). Ceph can work with any number of meta-data servers, so redundancy is assured there (data itself can be replicated in a configurable way too).
- Availability/redundancy 2: either I'm a fool and I didn't find out about that, or none of the three candidates provides for automated rebalancing. Apparently Tahoe provides a way to “repair” files without enough redundancy, but that needs to be run manually on each file, rather than being systematic, and it's a kind of add-on to the system rather than being properly integrated. Ceph only does rebalancing of meta-data. XtreemFS seems to work a bit like RAID1 with spare disks, but details are scarce.
- Confidentiality: Tahoe wins, hands down. Neither network sniffers nor storage nodes can see the contents of the stored files. They can't even find out about their name or the directory structure. XtreemFS encrypts the data on the network (but it is still stored in cleartext on the nodes). Ceph doesn't even try.
- Performance: Ceph wins (or loses less, at any rate): it seems possible to configure a topology for the storage nodes, and to drive the storage location policy according to this topology; so I could define sets of nodes (“my servers in datacenters”, “my plug computer at home”, “my desktop and laptop computers”, “my friends' computers”) and decide that each file should be stored at least once on each set of nodes. Tahoe stores shares of files on a number of storage nodes chosed in kind of a round-robin way; read access uses the same round-robin system, which means that you will probably end up fetching a file at least partially over your slow link even when you could fetch it entirely from the local network. The XtreemFS website doesn't seem to acknowledge the possibility of the network being slow.
- Integration: Ceph and XtreemFS win: they're native filesystems. Tahoe is in the “storage and retrieval system” category; the FUSE layer is marked as experimental and not recommended, and the SFTP layer (which can be mounted as a filesystem with sshfs) has many documented drawbacks. Maybe a WebDAV frontend, combined with fusedav, would provide a working alternative, but it's not implemented yet.
- Scalability: all three candidates boast about scalability (in data size and in number of nodes). Ceph seems to require some configuration every time a new storage node is added. Tahoe seems much more dynamic: new nodes (storage or clients) just need to be told where the introducer node is, and then they merge into the grid seamlessly by being told about other nodes (a bit like Bittorrent, where clients learn about each other by asking the tracker).
The overall picture seems full of good things scattered across different solutions, but unfortunately none of the existing ones seems to address the whole problem; at least, not my whole problem. It would be good if each focused on one layer and did that layer well, but that seems not to be the case either, so they can't be combined to get the best of all worlds. It may be that I'm missing something, or that I failed to read some docs properly, or that I misunderstood the docs, or that the docs themselves are simply lacking; but my ideal UDFS currently doesn't seem to exist as a turn-key solution.
However, the main pieces are available, and implementing the remaining parts may be doable. My humble idea of a way forward would be based on Tahoe-LAFS, with the following three changes:
- A configurable dispatch policy, so that administrators could define their own behaviour. The current Tahoe parts already turn files into a number of blobs that only need to be stored on various nodes, and the function that chooses which nodes seems to be reasonably self-contained, so it could maybe even be made pluggable, and an administrator could implement a storage policy that matches the constraints and the network topology. Ditto for the function that picks which nodes the data is read from (which, as far as I can tell, is the same; splitting could bring some benefits here).
- A working FUSE implementation, either on top of SFTP if the drawbacks can be fixed, on top of WebDAV if that gets implemented, or native to Tahoe if that can be done.
- Automatic rebalancing of data when new nodes are added or turned on, or when an intermittent network link comes up. The current script seems to be an afterthought; if it can be made more automated and more reliable, I'll be happy with that.
Also nice to have would be a way to work with multiple introducer nodes, but that seems to be in the works already. This would be pretty damn close to my UDFS; read/write performance would certainly be far from what can be obtained on native filesystems stored on local disks, but my use cases involve reasonably small files for which instant access is not compulsory, and the filesystem cache would probably absorb most of the access times.
In case anyone is looking for ideas of things to do in their spare time, here are rough sketches of other possible UDFS implementations I thought of. These are wild ideas, and I'm not even sure they could be doable in practice:
- eCryptfs on top of Ceph. The file contents are still only accessible to those in possession of the adequate key, and replication/distribution is handled by Ceph. That could probably work; the main drawbacks I can see come from Ceph's administrative overhead and manual configuration.
- LUKS and a standard filesystem, on top of the part of Ceph that manages block devices (RADOS). I'm not sure this could be mounted simultaneously by several nodes, though.
- LUKS on top of Ceph's RADOS, and Lustre or GlusterFS on top of that? Am I even making sense?
- A kind of union filesystem with local caching, on top of Tahoe-LAFS. Say we take unionfs-fuse, and add local caching: if the requested file is already in cache, performance is near native; if not, we still get the advantages of Tahoe. I'm not sure how caching works with WebDAV, but once Tahoe gets a WebDAV frontend it might be a simple matter of adding a Squid cache between that and davfs.
- I even thought of building a filesystem on top of the Bittorrent protocol; redundancy would be obtained by ensuring that there are at least N “distributed copies” of a file among the peers. I gave up even before reaching the details phase, though, but maybe a variant of the idea could be workable.
Such is the state of my research so far. I would welcome feedback, pointers to things I neglected to read, corrections for things I misread or misunderstood, comments on the ideas and so on. I'll probably post an update if my search goes significantly forward.
Update: I've already received two pieces of feedback, including a lengthy one with corrections about Tahoe-LAFS. For the sake of fairness, I'll solicit (and wait for) the same from the other candidates I looked at. I was also pointed at HekaFS, GlusterFS and git-annex, which I'll have to look at in more details. Other suggestions are still welcome, but the more I get, the more the full update will be delayed. Thanks already!
Update 2: See take 2 for the full update.