Atp's external memory

Glusterfs, NFS, MooseFS

Disruptor free zone.

Its long been our plan to move to a better network filesystem than NFS - ideally a distributed one so that we can get rid of our current NFS+DRBD+Pacemaker storage servers, and move to something a bit more integrated and active/active.

Looking at our requirements the options are; Lustre, Gluster or MooseFS.  There are lots of other good candidates out there; Pomegranate, Mogile, Sector/Sphere, Ceph etc... See the wikipedia page on distributed filesystems for more detail.

Here are our first impressions, and some basic performance data comparisons.

(Updated 29/Jul/2011 - following some very helpful feedback from a gluster architect about some areas that could have been much clearer).

Our use case is bulk data in large files - straight forward filesystem operations in other words.

Having written bits of a FUSE based file system I'm a bit cautious of them - probably down to "familiarily breeding contempt" more than anything else. So as both gluster and moosefs are FUSE based, I was a little dubious to start with. I'm still of the opinion that an in kernel filesystem client would probably benefit in terms of throughput, latency and the ability to use the page cache for async writes. 

I could write several paragraphs on the filesystems we didn't look at, but of all of them the only one that could be considered in the same categories of - "Free", "Production Ready", "In Wide Use", "Not Requiring Bleeding Edge Kernel" was Lustre. 

Which is now owned by Oracle Corp.

Which is a fairly good reason by itself to avoid making it part of our infrastructure. What sealed it however was the lack of data replication/resiliency. True we have raid 5, 6, 10 to choose from. But still the idea is to get replication for free in the file system. 

Test Rig

The initial quick look test rig is two machines  connected by a gigabit cable. "bar"  has an SSD, and acted as the storage server, and "Foo" which was the client. Both have enough ram for it not be a bother, and plenty of cpu capacity. For proper replicated testing we're planning on having 6 desktops with SSDs in.

  • Foo:  4 core Opteron 2214, 12 GB Ram. Ubuntu 10.04 64 bit
  • Bar:  2 core P9400, 4gb RAM, Fedora 12 64bit.

(Update: SSD stats for running the tests locally now added below the chart. The underlying file system is EXT4) 


Really easy to install. A deb for the ubuntu system and rpms for the fedora system. It was pretty easy to figure out what did what, and following the instructions on the web site had me up and running on a volume with a single replica within 10 minutes.

Playing around it quickly showed that you had to be careful with the contents of the storage pool directory, and that this was not like an NFS export. It took a while to understand that the "repair" mechanism required a poke on each file on the client.  In the end I settled on $touch test.file as being the best way forward.

Adding storage bricks on the fly to a replicated setup didn't seem possible - but I didn't spend that much time on it,  but the gluster command seemed quite clear and easy to use.The logfile seemed particularly opaque, and the first time I set it up I managed to get localhost listed as a peer for itself, which shouldn't happen. This caused all sorts of errors in the log file about connections being refused, but these seemed to be normal after much googling.

In short, simple to set up and get running. Once running the only thing I had issues with was the gluster nfs server causing problems with the NFS tests until I shut gluster down completely (obvious in hindsight).

Version Tested: 3.2.1 64bit.

The two brick configuration is a replicated setup, not distributed or striped.  


This felt far more industrial and professional for some reason. Perhaps its because it matched my preconception of what a distributed filesystem should look like. There are far more moving parts, and unlike gluster there is a single point of failure with the metadata server. There are moves towards fixing that it seems. I ran the master, chunkserver and webserver on the same storage node, and the client on the other.

No debs or rpms. There was a spec file and with a small edit to the version number, and repack of the .tar.gz file I was able to build rpms. On the Ubuntu side, I ended up doing a traditional configure/make/make install. Which irks me a little.  

The biggest trip up with Moose was the file handling for deleted files. Moose doesn't delete files right away. It moves them to a "trash" meta folder for a defined quarantine time. After this time has expired it deletes them - slowly.  I can see the benefit of this - it would complement our rsnapshot based timed backups on the fileserver for example, but during testing I frequently ran out of space even when it appeared that there was free space according to "df". After reducing the quarantine time to 10 seconds, or deleting the files from the trash directory in the meta folder mount, the space was still not freed up immediately (even though the gui showed 0 bytes in trash), instead trickling in over the next few minutes. I'm sure there's a way to tune this, but it wasn't immediately evident.

Also disconcerting was that the total disk space decreased by the size of the deleted files upon deletion. It makes sense, but its wierd to see your disk apparently changing size as you delete files.

Moose "felt" faster than glusterfs, and has lots of nice features - per file/directory replica targets, snapshots - all of which I didn't test. The feeling of speed may be down to the client side read cache.

Version tested: 1.6.20-2


Our intended workload is large files, write performance.

Update: To clarify, all tests were run on a freshly mounted file system to prevent caching effects


Test 1

dd if=/dev/zero  bs=8192 count=131072 of=testcat.img oflag=noatime iflag=fullblock \

Test 2

dd if=/dev/zero  bs=8192 count=131072 | cat > testcat.img

(yes thats rather lazy)

(Update: Whats going on here is that the pipe will break the i/o into 4kb chunks. Cat will then buffer that up some large block size. In other words, the bs=8192 is only relevant for the read from /dev/zero. What I should have done to not be lazy was to use ibs=8192 obs=1024k or similar)

Test 3

Copy a file from the file system to another file on the same filesystem (read/write mix)

time cp testcat.img testcat2.img

Test 4

Read a gigabyte from the file system to local ram disk

time cp testcat.img /tmp/test2.img

Test 5

Write a gigabyte from local ram disk to the file system

time cp /tmp/test2.img  testcat2.img

A note about NFS

As these file systems seemed to care about data integrity it seemed only fair to apply the same rules to NFS. By default NFS comes with the seatbelts off - async i/o. In this case your write will complete and your data may not have hit the disks at the far end.  So we tested with synchronous NFS ( mounted -o sync) as well.

NFS was version 3, mounted with default wsize and rsize and noatime. 


All tests with Jumbo Frames - 9000 bytes. For NFS async in particular the file system was unmounted to clear the local cache for the read tests.

  • Unsurprisingly NFS Async wins. The Linux NFS server is not fantastic, but works well enough here. However its very much an apple to lemons comparison, given the caching going on.
  • I was  interested to see the actual performance of moosefs. Its not out ahead, but its much more consistent, probably due to the caching on the client side it uses. It was however the slowest on the read test. I have the feeling I've not configured this optimally (see below).
  • Glusterfs really suffers on Test1. It has the feel of a completely synchronous file system.

(update: The underlying SSD is capable of the following; Test1: 125 MiB/s, Test2: 148 MiB/s, Test3: 58 MiB/s, Test 4: 94 MiB/s, Test 5: 120 MiB/s. I elected not to put it on the chart as it made the results - partcularly for NFS hard to read. Test 3 and 4 are the only ones where the network FS approach the speed of the disk.)

Network Traffic Plots

I learnt alot about the performance by looking at the network and cpu plots. Particularly the moosefs write chart - the engineer in me says "impedance mismatch", and I think I'll need to get to the bottom of whats going on to get the best out of this FS. The pattern is unchanged for jumbo or 1500 MTU, so its got to be a property of something else, perhaps the 64MB internal block size. The write behind caching behaviour of the NFS Async client clearly stands out here.

1) NFS Sync Copy

2) NFS Async Copy

3) GlusterFS Copy

glusterfs copy

(Update: to make this the same as the other charts - this is a single brick copy (test 3))

4) MooseFS Copy

5) MooseFS Write only


So there you have it - a quick look. The next test is to see how they scale with more than one chunk/brick server and see what happens stability wise when you let a bunch of very talented systems admins loose on them.

Written by atp

Thursday 14 July 2011 at 1:29 pm

Posted in Linux

Leave a Reply