Comments Page - JuiceFS is a distributed POSIX file system built on top of Redis and S3

« Back JuiceFS is a distributed POSIX file system built on top of Redis and S3github.comSubmitted by tosh 3 days ago

wgjordan 3 days ago
Related, "The Design & Implementation of Sprites" [1] (also currently on the front page) mentioned JuiceFS in its stack:
> The Sprite storage stack is organized around the JuiceFS model (in fact, we currently use a very hacked-up JuiceFS, with a rewritten SQLite metadata backend). It works by splitting storage into data (“chunks”) and metadata (a map of where the “chunks” are). Data chunks live on object stores; metadata lives in fast local storage. In our case, that metadata store is kept durable with Litestream. Nothing depends on local storage.
[1] https://news.ycombinator.com/item?id=46634450
staticassertion 3 days ago
Do people really trust Redis for something like this? I feel like it's sort of pointless to pair Redis with S3 like this, and it'd be better to see benchmarks with metadata stores that can provide actual guarantees for durability/availability.
Unfortunately, the benchmarks use Redis. Why would I care about distributed storage on a system like S3, which is all about consistency/durability/availability guarantees, just to put my metadata into Redis?
It would be nice to see benchmarks with another metadata store.
- onionjake 3 days ago
  We developed Object Mount (formerly cunoFS) (https://www.storj.io/object-mount?hn=1) specifically to not rely on any metadata storage other than S3 AND preserve 1:1 mapping of objects to files AND support for POSIX. We have a direct mode that uses LD_PRELOAD to keep everything in userspace so no FUSE overhead.
  This approach isn't right for every use case and juice might be a better fit for this sort of 'direct block store', but wanted to include it here for folks that might want something like Juice but without having to maintain a metadata store.
  (Disclosure: I work at Storj that develops Object Mount)
  joshstrange 2 days ago
  I am currently looking for a way to take a legacy application that uses the filesystem as it's database and needs to support locking (flock) on FreeBSD and scale it horizontally (right now we only scale vertically and rebuilding from a corrupted FS and/or re-pulling our, backup, data from S3 takes too long if we lose a machine). We investigated NFS but the FreeBSD NFS performance was 1/10th or worse than on Linux and switching to Linux is not in the cards for us right now.
  Does Object Mount support file locks (flock specifically) and FreeBSD? I see some mention of FreeBSD but I can't find anything on locking.
  For context, we are working with a large number of small (<10KB if not <4KB) files normally.
  hnarn a day ago
  According to this link:
  https://vonng.com/en/pg/pgfs/
  > The magic is: JuiceFS also supports using PostgreSQL as both metadata and object data storage backend! This means you only need to change JuiceFS’s backend to an existing PostgreSQL instance to get a database-based “filesystem.”
  Sounds ideal for your kind of situation where a filesystem is being abused as a database :-)
  onionjake a day ago
  No BSD support unfortunately. Just Linux, Mac, windows.
  lifty 2 days ago
  Is this open source? I am a happy Storj customer, would love to use it if it's open source.
  onionjake a day ago
  Not open source. What would be your use case?
  ridruejo 2 days ago
  Pretty cool
- suavesu 2 days ago
  JuiceFS metadata engine comparison -> https://juicefs.com/docs/community/metadata_engines_benchmar...
  staticassertion 2 days ago
  Thanks, good to see these. Looks like these only show metadata operations though. But I guess that's probably enough to extrapolate that the system is going to be roughly 2-4x slower with other metadata stores.
- bastawhiz 3 days ago
  Redis is as reliable as the storage you persist it to. If you're running Redis right, it's very reliable. Not S3 reliable, though. But if you need S3 reliable, you would turn to something else.
  I expect that most folks looking at this are doing it because it means:
  1. Effectively unbounded storage
  2. It's fast
  3. It's pretty darn cheap
  4. You can scale it horizontally in a way that's challenging to scale other filesystems
  5. All the components are pretty easy to set up. Many folks are probably already running S3 and Redis.
  danpalmer 3 days ago
  Redis isn't durable unless you drastically reduce the performance.
  Filesystems are pretty much by definition durable.
  bastawhiz 2 days ago
  Enabling the WAL doesn't make Redis slow. It's slower than the default, but it's still exceptionally fast.
  > Filesystems are pretty much by definition durable.
  Where do you think Redis persists its data to
  staticassertion 3 days ago
  > Redis is as reliable as the storage you persist it to.
  For a single node, if you tank performance by changing the configuration, sure. Otherwise, no, not really.
  I don't get why you'd want a file system that isn't durable, but to each their own.
  bastawhiz 2 days ago
  With the wal redis runs with perfectly reasonable performance. Of course you're not going to have the performance of an in-memory only DB if you're flushing to disk on every write.
  There's no vacuuming, there's no need for indexing. You can see the time complexity of most operations. Key-value operations are mostly O(1). You'll never get that kind of performance with other databases because they intentionally don't give you that granularity.
  The metadata of the filesystem isn't the performance bottleneck in most cases.
  staticassertion a day ago
  > With the wal redis runs with perfectly reasonable performance. Of course you're not going to have the performance of an in-memory only DB if you're flushing to disk on every write.
  But that's not what gets benchmarked against.
  > You'll never get that kind of performance with other databases because they intentionally don't give you that granularity.
  Okay but you also won't get horizontal scaling while maintaining consistency in Redis.
  > The metadata of the filesystem isn't the performance bottleneck in most cases.
  Okay, but they do show that other metadata stores are 2-4x slower than Redis, so it would be great to see whole-system benchmarks that use those.
  doctorpangloss 3 days ago
  > 4. You can scale it horizontally in a way that's challenging to scale other filesystems
  Easy to scale on RDS, along with everything else. But there’s no Kubernetes operator. Is there a better measure of “easy” or “challenging?” IMO no. Perhaps I am spoiled by CNPG.
- tuhgdetzhh 3 days ago
  I think they should replace Redis with Valkey or even better use rocksdb.
  fpoling 2 days ago
  But how would RocksDB work with S3? It needs support for append that generic S3 buckets do not provide and for checkpoints and backup it assumes the support for hardlinks that S3 does not have at all.
  tuhgdetzhh 2 days ago
  SeaweedFS has a RocksDB metadata backend for instance.
  alexhornby 2 days ago
  [dead]
- ycombinatrix 3 days ago
  It says MySQL can be used instead of Redis for the metadata
  suavesu 2 days ago
  yes, support more 10 options, include redis, SQL-like DB, TiKV, FoundationDB, and more. see here -> https://juicefs.com/docs/community/databases_for_metadata
willbeddow 3 days ago
Juice is cool, but tradeoffs around which metadata store you choose end up being very important. It also writes files in it's own uninterpretable format to object storage, so if you lose the metadata store, you lose your data.
When we tried it at Krea we ended up moving on because we couldn't get sufficient performance to train on, and having to choose which datacenter to deploy our metadata store on essentially forced us to only use it one location at a time.
- tptacek 3 days ago
  I'm betting this is on the front page today (as opposed to any other day; Juice is very neat and doesn't need us to hype it) because of our Sprites post, which goes into some detail about how we use Juice (for the time being; I'm not sure if we'll keep it this way).
  The TL;DR relevant to your comment is: we tore out a lot of the metadata stuff, and our metadata storage is SQLite + Litestream.io, which gives us fast local read/write, enough systemwide atomicity (all atomicity in our setting runs asymptotically against "someone could just cut the power at any moment"), and preserves "durably stored to object storage".
  ghm2199 3 days ago
  Litestream.io is amazing. Using sqlite as a DB in a typical relational data model where objects are related mean most read then write transactions would have to one node, but if the using it for blobs as first class objects(e.g. video uploads or sensor data) which are independent probably means you can shard and scale your set up the wazoo right?
- c4pt0r 2 days ago
  In large-scale metadata scenarios, JFS recommends using a distributed key-value store to host metadata, such as TiKV or FoundationDB. Based on my experience with large JFS users, most of them choose TiKV.
  Disclaimer: I'm the co-founder of TiKV.
  suavesu 2 days ago
  Truth. It supports very large volume, more than 100PiB and 10B files in a single volume.
  Disclosure: I'm co-founder of JuiceFS
- AdamJacobMuller 3 days ago
  > It also writes files in it's own uninterpretable format to object storage, so if you lose the metadata store, you lose your data.
  That's so confusing to me I had to read it five times. Are you saying you lose the metadata, or that the underlying data is actually mangled or gone, or merely that you lose the metadata?
  One of the greatest features of something like this to me would be the ability to durable even beyond JuiceFS access to my data in a bad situation. Even if JuiceFS totally messes up, my data is still in S3 (and with versioning etc even if juicefs mangles or deletes my data, still). So odd to design this kind of software and lose this property.
  mrkurt 3 days ago
  It backs its metadata up to S3. You do need metadata to map inodes / slices / chunks to s3 objects, though.
  Tigris has a one-to-one FUSE that does what you want: https://github.com/tigrisdata/tigrisfs
  ifoxhz 3 days ago
  FUSE generally has low overall performance because of an additional data transfer process between the kernel space and user space, which is less than ideal for AI training.
  suavesu 2 days ago
  The MLPerf performance of JuiceFS last year, FUSE client, and TCP/IP network. There're some analysis about performance and bottleneck. -> https://juicefs.com/en/blog/engineering/mlperf-storage-v2-ai...
  cbarrick 3 days ago
  As I understand it, if the metadata is lost then the whole filesystem is lost.
  I think this is a common failure mode in filesystems. For example, in ZFS, if you store your metadata on a separate device and that device is destroyed, the whole pool is useless.
  suavesu 2 days ago
  metadata backup is very important. don't forget.
  undefined 3 days ago
  [deleted]
jeffbee 3 days ago
It is not clear that pjdfstest establishes full POSIX semantic compliance. After a short search of the repo I did not see anything that exercises multiple unrelated processes atomically writing with O_APPEND, for example. And the fact that their graphic shows applications interfacing with JuiceFS over NFS and SMB casts further doubt, since both of those lack many POSIX semantic properties.
Over the decades I have written test harnesses for many distributed filesystems and the only one that seemed to actually offer POSIX semantics was LustreFS, which, for related reasons, is also an operability nightmare.
- suavesu 2 days ago
  pjdtest result of JuiceFS -> https://juicefs.com/en/blog/engineering/posix-compatibility-...
adamcharnock 2 days ago
We've been using JuiceFS in production for a few months now and I'm a big fan. I've felt for a while that block-level filesystems do not adapt at all well to being implemented across a network (with my personal experience being of AWS EBS and OpenEBS Mayastor). So the fact that JuiceFS is interfaces at the POSIX layer felt intuitively better to me.
I also like that it can keep a local read-cache, rather than having to hit object storage for every read. This is because it can perform a freshness check with with the (relatively fast) metadata store to determine if its cached data is valid, prior to serving the request from cache.
We back it with a 3-node (Redis-compatible) HA Valkey cluster, and in-cluster MinIO object storage, all in bare-metal Kubernetes. We can saturate a 25g NIC with (IIRC) 16+ concurrent users.
It is also one of the few Kubernetes storage providers that provides read-write-many (RWX) access. Which can also be rather helpful in some situations.
In an early test we were running it against MinIO with zero redundancy. Which is not recommended in any case. There we did see some file corruption creep in. In which case some files in JuiceFS become unreadable, but the system as a whole kept working.
Another reason I think JuiceFS works well, is indeed because of its custom block-based storage format. It is disconcerting because you cannot see your files in object storage, but instead just a lot of chunks. But this does buy some real performance benefits, especially when doing partial file reads or updates.
Another test we're doing is running a small-to-medium sized Prometheus persisted to JuiceFS. It hasn't shown any issues so far.
And, if you've made it this, far: check us out if you want a hand installing and operating this kind of infra: https://lithus.eu . We deploy to bare-metal Hetzner.
- hn92726819 2 days ago
  > There we did see some file corruption creep in
  Did you figure out what caused corruption? Was minio losing blocks or was juicefs corrupted even though minio was consistent?
  adamcharnock 2 days ago
  It was definitely MinIO related, I probably should have made that clearer. We noticed that with zero fault tolerance, MinIO objects would randomly become corrupted, which MinIO would present as "you're making too many requests, please slow down". We were certainly not making too many requests.
  maxloh 2 days ago
  What is your plan after MinIO enters maintenance mode?
  adamcharnock 2 days ago
  We're looking at alternatives, I've made some previous comments on that front. Sadly MinIO was the only option with sufficient performance for this particular situation. Thankfully we're not using any MinIO-specific features, so at least the migration path away is clear.
  fakebizprez a day ago
  Ceph. The answer is always Ceph.
eerikkivistik 2 days ago
I've had to test out various networked filesystems this year for a few use cases (satellite/geo) on a multi petabyte scale. Some of my thoughts:
* JuiceFS - Works well, for high performance it has limited use cases where privacy concerns matter. There is the open source version, which is slower. The metadata backend selection really matters if you are tuning for latency.
* Lustre - Heavily optimised for latency. Gets very expensive if you need more bandwidth, as it is tiered and tied to volume sizes. Managed solutions available pretty much everywhere.
* EFS - Surprisingly good these days, still insanely expensive. Useful for small amounts of data (few terabytes).
* FlexFS - An interesting beast. It murders on bandwidth/cost. But slightly loses on latency sensitive operations. Great if you have petabyte scale data and need to parallel process it. But struggles when you have tooling that does many small unbuffered writes.
- mickael-kerjean 2 days ago
  Nothing around content addressable storage? Has anyone used something like IPFS / Kubo in production at that kind of scale?
  (for those who don't know IPFS, I find the original paper fascinating: https://arxiv.org/pdf/1407.3561)
  eerikkivistik 2 days ago
  The latency and bandwidth really isn't there for HPC.
- romantomjak 2 days ago
  Did you happen to look into CephFS? CERN (folks that operate Large Hadron Collider) use it to store ~30PB of scientific data. Their analysis cluster is serving ~30GB/s reads
  eerikkivistik 2 days ago
  Sure, so the use case I have requires elastic storage and elastic compute. So CephFS really isn't a good fit in the cloud environment for that case. It would get prohibitively expensive.
tuhgdetzhh 3 days ago
If tested various Posix FS projects over the years and everyone has their shortcomings in one way or the other.
Although the maintainers of these projects disagree, I mostly consider them as a workaround for smaller projects. For big data (PB range) and critical production workloads I recommend to bite the bullet and make your software nativley S3 compatible without going over a POSIX mounted S3 proxy.
- daviesliu 3 days ago
  JuiceFS can be scaled to hundreds of PB by design, also is verified by thousands of users in production [1].
  [1] https://juicefs.com/en/blog/company/2025-recap-artificial-in...
- suavesu 2 days ago
  Agree, don't recommend POSIX proxy on S3 for complex workload, like S3FS. In the design of JuiceFS, S3 is like raw disk, JuiceFS metadata engine is like partition table, compare with local file system.
- ifoxhz 3 days ago
  I think so.
mattbillenstein 3 days ago
The key I think with s3 is using it mostly as a blobstore. We put the important metadata we want into postgres so we can quickly select stuff that needs to be updated based on other things being newer. So, we don't need to touch s3 that often if we don't need the actual data.
When we actually need to manipulate or generate something in Python, we download/upload to S3 and wrap it all in a tempfile.TemporaryDirectory() to cleanup the local disk when we're done. If you don't do this, you end up with a bunch of garbage eventually in /tmp/ you need to deal with.
We also have some longer-lived disk caches and using the data in the db and a os.stat() on the file we can easily know if the cache is up to date without hitting s3. And this cache, we can just delete stuff that's old wrt os.stat() to manage the size of it since we can always get it from s3 again if needed in the future.
IshKebab 3 days ago
Interesting. Would this be suitable as a replacement for NFS? In my experience literally everyone in the silicon design industry uses NFS on their compute grid and it sucks in numerous ways:
* poor locking support (this sounds like it works better)
* it's slow
* no manual fence support; a bad but common way of distributing workloads is e.g. to compile a test on one machine (on an NFS mount), and then use SLURM or SGE to run the test on other machines. You use NFS to let the other machines access the data... and this works... except that you either have to disable write caches or have horrible hacks to make the output of the first machine visible to the others. What you really want is a manual fence: "make all changes to this directory visible on the server"
* The bloody .nfs000000 files. I think this might be fixed by NFSv4 but it seems like nobody actually uses that. (Not helped by the fact that CentOS 7 is considered "modern" to EDA people.)
- mrkurt 3 days ago
  FUSE is full of gotchas. I wouldn't replace NFS with JuiceFS for arbitrary workloads. Getting the full FUSE set implemented is not easy -- you can't use sqlite on JuiceFS, for example.
  The meta store is a bottleneck too. For a shared mount, you've got a bunch of clients sharing a metadata store that lives in the cloud somewhere. They do a lot of aggressive metadata caching. It's still surprisingly slow at times.
  huntaub 3 days ago
  > FUSE is full of gotchas
  I want to go ahead and nominate this for the understatement of the year. I expect that 2026 is going to be filled with people finding this out the hard way as they pivot towards FUSE for agents.
  dpe82 3 days ago
  Mind helping us all out ahead of time by expanding on what kind of gotchas FUSE is full of?
  huntaub 3 days ago
  It depends on what level of FUSE you're working with.
  If you're running a FUSE adapter provided by a third party (Mountpoint, GCS FUSE), odds are that you aren't going to get great performance because it's going to have to run across a network super far away to work with your data. To improve performance, these adapters need to be sure to set fiddly settings (like using Kernel-side writeback caching) to avoid the penalty of hitting the disk for operations like write.
  If you're trying to write a FUSE adapter, it's up to you to implement as much of the POSIX spec that you need for the programs that you want to run. The requirements per-program are often surprising. Want to run "git clone", then you need to support the ability to unlink a file from the file system and keep its data around. Want to run "vim", you need the ability to do renames and hard links. All of this work needs to happen in-memory in order to get the performance that applications expect from their file system, which often isn't how these things are built.
  Regarding agents in particular, I'm hopeful that someone (which is quite possibly us), builds a FUSE-as-a-service primitive that's simple enough to use that the vast majority of developers don't have to worry about these things.
  IshKebab 3 days ago
  > you need to support the ability to unlink a file from the file system and keep its data around. Want to run "vim", you need the ability to do renames and hard links
  Those seem like pretty basic POSIX filesystem features to be fair. Awkward, sure... there's also awkwardness like symlinks, file locking, sticky bits and so on. But these are just things you have to implement. Are there gotchas that are inherent to FUSE itself rather than FUSE implementations?
  huntaub 3 days ago
  These are basic POSIX features, but I think the high-level point that Kurt is trying to make is that building a FUSE file system signs you up for a nearly unlimited amount of compatibility work (if you want to support most applications) whereas their approach (just do a loopback ext4 fs into a large file) avoids a lot of those problems.
  My expectations are that in 2026 we will see more and more developers attempt to build custom FUSE file systems and then run into the long tail of compatibility pain.
  IshKebab 3 days ago
  > just do a loopback ext4 fs into a large file
  How does that work with multiple clients though?
  huntaub 3 days ago
  tl;dr it doesn't. I'm not sure what they're planning in this capacity (I haven't checked out sprites myself), but I would guess that it's going to be a function of "snapshots" as a mechanism to give multiple clients ephemeral write access to the same disk.
- jabl 3 days ago
  > poor locking support (this sounds like it works better)
  File locking on Unix is in general a clusterf*ck. (There was a thread a few days ago at https://news.ycombinator.com/item?id=46542247 )
  > no manual fence support; a bad but common way of distributing workloads is e.g. to compile a test on one machine (on an NFS mount), and then use SLURM or SGE to run the test on other machines. You use NFS to let the other machines access the data... and this works... except that you either have to disable write caches or have horrible hacks to make the output of the first machine visible to the others. What you really want is a manual fence: "make all changes to this directory visible on the server"
  In general, file systems make for poor IPC implementations. But if you need to do it with NFS, the key is to understand the close-to-open consistency model NFS uses, see section 10.3.1 in https://www.rfc-editor.org/rfc/rfc7530#section-10.3 . Of course, you'll also want some mechanism for the writer to notify the reader that it's finished, be it with file locks, or some other entirely different protocol to send signals over the network.
  IshKebab 3 days ago
  > In general, file systems make for poor IPC implementations.
  I agree but also they do have advantages such as simplicity, not needing to explicitly declare which files are needed, lazy data transfer, etc.
  > you'll also want some mechanism for the writer to notify the reader that it's finished, be it with file locks, or some other entirely different protocol to send signals over the network.
  The writer is always finished before the reader starts in these scenarios. The issue is reads on one machine aren't guaranteed to be ordered after writes on a different machine due to write caching.
  It's exactly the same problem as trying to do multithreaded code. Thread A writes a value, thread B reads it. But even if they happen sequentially in real time thread B can still read an old value unless you have an explicit fence.
  jabl 2 days ago
  > The writer is always finished before the reader starts in these scenarios. The issue is reads on one machine aren't guaranteed to be ordered after writes on a different machine due to write caching.
  In such a case it should be sufficient to rely on NFS close-to-open consistency as explained in the RFC I linked to in the previous message. Closing a file forces a flush of any dirty data to the server, and opening a file forces a revalidation of any cached content.
  If that doesn't work, your NFS is broken. ;-)
  And if you need 'proper' cache coherency, something like Lustre is an option.
  IshKebab 2 days ago
  It wasn't my job so I didn't look into this fully, but the main issue we had was clients claiming that files didn't exist when they did. I just reread the NFS man page and I guess this is the issue:
  > To detect when directory entries have been added or removed on the server, the Linux NFS client watches a directory's mtime. If the client detects a change in a directory's mtime, the client drops all cached LOOKUP results for that directory. Since the directory's mtime is a cached attribute, it may take some time before a client notices it has changed. See the descriptions of the acdirmin, acdirmax, and noac mount options for more information about how long a directory's mtime is cached.
  > Caching directory entries improves the performance of applications that do not share files with applications on other clients. Using cached information about directories can interfere with applications that run concurrently on multiple clients and need to detect the creation or removal of files quickly, however. The lookupcache mount option allows some tuning of directory entry caching behavior.
  People did talk about using Lustre or GPFS but apparently they are really complex to set up and maybe need fancier networking than ethernet, I don't remember.
  lstodd 2 days ago
  I did set up GPFS tadam... almost exactly 20 years ago. I wouldn't say it absolutely required fancy networking (infiniband) or was extraordinary complex to set up, certainly on par with NFS when you hit its quirks (which was the reason we went off experimenting with gpfs and whatnot).
- huntaub 3 days ago
  > * The bloody .nfs000000 files. I think this might be fixed by NFSv4 but it seems like nobody actually uses that. (Not helped by the fact that CentOS 7 is considered "modern" to EDA people.)
  Unfortunately, NFSv4 also has the silly rename semantics...
  jabl 3 days ago
  AFAIU the NFSv4 protocol in principle allows implementing unlinking an open file without silly rename, but the Linux client still does the silly rename dance.
- xorcist 3 days ago
  > NFSv4 but it seems like nobody actually uses that
  Hurry up and you might be able to adopt it before its 30th birthday!
- ekropotin 3 days ago
  How about CephFs?
weinzierl 2 days ago
The consistency guarantees are what makes this interesting in my opinion.
> * Close-to-open consistency. Once a file is written and closed, it is guaranteed to view the written data in the following opens and reads from any client. Within the same mount point, all the written data can be read immediately.*
> Rename and all other metadata operations are atomic, which are guaranteed by supported metadata engine transaction.
This is a lot more than other "POSIX compatible" overlays claim, and I think similar to what NFSv4 promises. There are lots of subtitles there, though, and I doubt you could safely run a database on it.
- suavesu 2 days ago
  Can run MySQL and PG, but don't recommend, not good performance for production. but for temporary it's OK. Here's a case study -> https://juicefs.com/en/blog/user-stories/xiachufang-mysql-ba...
  And here's a POSIX compatibility comparison with other cloud file system, like AWS EFS. https://juicefs.com/en/blog/engineering/posix-compatibility-...
hsn915 3 days ago
This is upside down.
We need a kernel native distributed file system so that we can build distributed storage/databases on top of it.
This is like building an operating system on top of a browser.
- sroerick 2 days ago
  I'm glad you noticed this, I thought this was a wildly insane thing to do. Its like the satanic inversion of 9P protocol
- satoru42 3 days ago
  Show me an operating system built on top of a browser that can be used to solve real-world problems like JuiceFS.
  seany 3 days ago
  https://github.com/tractordev/apptron ?
  sroerick 2 days ago
  Why are you building an operating system on top of a browser?
  hsn915 2 days ago
  My criticism is of the basic architecture, not usability or fitness for a particular purpose.
  If a distributed file system is useful, then a properly architectured one is 100x more useful and more performant.
  ycombinatrix 3 days ago
  OrbitDB?
sabslikesobs 3 days ago
See also their User Stories: https://juicefs.com/en/blog/user-stories
I'm not an enterprise-storage guy (just sqlite on a local volume for me so far!) so those really helped de-abstractify what JuiceFS is for.
Plasmoid 3 days ago
I was actually looking at using this to replace our mongo disks so we could easily cold store our data
Eikon 3 days ago
ZeroFS [0] outperforms JuiceFS on common small file workloads [1] while only requiring S3 and no 3rd party database.
[0] https://github.com/Barre/ZeroFS
[1] https://www.zerofs.net/zerofs-vs-juicefs
- huntaub 3 days ago
  Respect to your work on ZeroFS, but I find it kind of off-putting for you to come in and immediately put down JuiceFS, especially with benchmark results that don't make a ton of sense, and are likely making apples-to-oranges comparisons with how JuiceFS works or mount options.
  For example, it doesn't really make sense that "92% of data modification operations" would fail on JuiceFS, which makes me question a lot of the methodology in these tests.
  selfhoster1312 3 days ago
  I have very limited experiences with object storage, but my humble benchmarks with juicefs + minio/garage [1] showed very bad performance (i.e. total collapse within a few hours) when running lots of small operations (torrents).
  I wouldn't be surprised if there's a lot of tuning that can be achieved, but after days of reading docs and experimenting with different settings i just assumed JuiceFS was a very bad fit for archives shared through Bittorrent. I hope to be proven wrong, but in the meantime i'm very glad zerofs was mentioned as an alternative for small files/operations. I'll try to find the time to benchmark it too.
  [1] https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1021
  Eikon 3 days ago
  > but I find it kind of off-putting for you to come in and immediately put down JuiceFS, especially with benchmark results that don't make a ton of sense, and are likely making apples-to-oranges comparisons with how JuiceFS works or mount options.
  The benchmark suite is trivial and opensource [1].
  Is performing benchmarks “putting down” these days?
  If you believe that the benchmarks are unfair to juicefs for a reason or for another, please put up a PR with a better methodology or corrected numbers. I’d happily merge it.
  EDIT: From your profile, it seems like you are running a VC backed competitor, would be fair to mention that…
  [1] https://github.com/Barre/ZeroFS/tree/main/bench
  wgjordan 3 days ago
  > The benchmark suite is trivial and opensource.
  The actual code being benchmarked is trivial and open-source, but I don't see the actual JuiceFS setup anywhere in the ZeroFS repository. This means the self-published results don't seem to be reproducible by anyone looking to externally validate the stated claims in more detail. Given the very large performance differences, I have a hard time believing it's an actual apples-to-apples production-quality setup. It seems much more likely that some simple tuning is needed to make them more comparable, in which case the takeaway may be that JuiceFS may have more fiddly configuration without well-rounded defaults, not that it's actually hundreds of times slower when properly tuned for the workload.
  (That said, I'd love to be wrong and confidently discover that ZeroFS is indeed that much faster!)
  huntaub 3 days ago
  Yes, I'm working in the space too. I think it's fine to do benchmarks, I don't think it's necessary to immediately post them any time a competitor comes up on HN.
  I don't want to see the cloud storage sector turn as bitter as the cloud database sector.
  I've previously looked through the benchmarking code, and I still have some serious concerns about the way that you're presenting things on your page.
  zaphirplane 3 days ago
  > presenting things
  I don’t have a dog in this race, have to say thou the vagueness of the hand waving in multiple comments is losing you credibility
  eYrKEC2 3 days ago
  I'm always curious about the of the option space. I appreciate folks talking about the alternative s. What's yours?
  huntaub 3 days ago
  Our product is Archil [1], and we are building our service on top of a durable, distributed SSD storage layer. As a result, we have the ability to: (a) store and use data in S3 in its native format [not a block based format like the other solutions in this thread], (b) durably commit writes to our storage layer with lower latency than products which operate as installable OSS libraries and communicate with S3 directly, and (c) handle multiple writers from different instances like NFS.
  Our team spent years working on NFS+Lustre products at Amazon (EFS and FSx for Lustre), so we understand the performance problems that these storage products have traditionally had.
  We've built a custom protocol that allows our users to achieve high-performance for small file operations (git -- perfect for coding agents) and highly-parallel HPC workloads (model training, inference).
  Obviously, there are tons of storage products because everyone makes different tradeoffs around durability, file size optimizations, etc. We're excited to have an approach that we think can flex around these properties dynamically, while providing best-in-class performance when compared to "true" storage systems like VAST, Weka, and Pure.
  [1] https://archil.com
- Dylan16807 3 days ago
  > ZeroFS supports running multiple instances on the same storage backend: one read-write instance and multiple read-only instances.
  Well that's a big limiting factor that needs to be at the front in any distributed filesystem comparison.
  Though I'm confused, the page says things like "ZeroFS makes S3 behave like a regular block device", but in that case how do read-only instances mount it without constantly getting their state corrupted out from under them? Is that implicitly talking about the NBD access, and the other access modes have logic to handle that?
  Edit: What I want to see is a ZeroFS versus s3backer comparison.
  Edit 2: changed the question at the end
- ChocolateGod 3 days ago
  Let's remember that JuiceFS can be setup very easily to not have a single point of failure (by replicating the metadata engine), meanwhile ZeroFS seems to have exactly that.
  If I was a company I know which one I'd prefer.
  __turbobrew__ 2 days ago
  Yea, that is a big caveat to ZeroFS. Single point of failure. It is like saying I can write a faster etcd by only having a single node. Sure, that is possible, but the hard part of distributed systems is the coordination, and coordination always makes performance worse.
  I personally have went with Ceph for distributed storage. I personally have a lot more confidence in Ceph over JuiceFS and ZeroFS, but realize building and running a ceph cluster is more complex, but with that complexity you get much cheaper S3, block storage, and cephfs.
  ChocolateGod 2 days ago
  I replaced a GlusterFS cluster with JuiceFS some years ago and it's been a relief. Just much easier to manage.
  suavesu 2 days ago
  Some users use JuiceFS with CephFS RADOS, as alternative with Ceph MDS.
- dpacmittal 3 days ago
  The magnitude of performance difference alone immediately makes me skeptical of your benchmarking methodology.
  selfhoster1312 3 days ago
  I'm not an expert in any way, but i personally benchmarked [1] juiceFS performance totalling collapsing under very small files/operations (torrenting). It's good to be skeptical, but it might just be that the bar is very low for this specific usecase (IIRC juiceFS was configured and optimized for block sizes of several MBs).
  https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/1021
- wgjordan 3 days ago
  For a proper comparison, also significant to note that JuiceFS is Apache-2.0 licensed while ZeroFS is dual AGPL-3.0/commercial licensed, significantly limiting the latter's ability to be easily adopted outside of open source projects.
  anonymousDan 3 days ago
  Why would this matter if you're just using the database?
  Eikon 3 days ago
  It doesn’t, you are free to use ZeroFS for commercial and closed source products.
  wgjordan 3 days ago
  This clarification is helpful, thanks! The README currently implies a slightly different take, perhaps it could be made more clear that it's suitable for use unmodified in closed source products:
  > The AGPL license is suitable for open source projects, while commercial licenses are available for organizations requiring different terms.
  I was a bit unclear on where the AGPL's network-interaction clause draws its boundaries- so the commercial license would only be needed for closed-source modifications/forks, or if statically linking ZeroFS crate into a larger proprietary Rust program, is that roughly it?
  wgjordan 3 days ago
  Also worth noting (as a sibling comment pointed out) that despite these assurances the untested legal risks of AGPL-licensed code may still cause difficulties for larger, risk-averse companies. Google notably has a blanket policy [1] banning all AGPL code entirely as "the risks outweigh the benefits", so large organizations are probably another area where the commercial license comes into play.
  [1] https://opensource.google/documentation/reference/using/agpl...
  Eikon 3 days ago
  > so the commercial license would only be needed for closed-source modifications/forks
  Indeed.
  andydang 3 days ago
  [dead]
- undefined 3 days ago
  [deleted]
- maxmcd 3 days ago
  does having to maintain the slatedb as a consistent singleton (even with write fencing) make this as operationally tricky as a third party db?
  Eikon 3 days ago
  It’s not great UX on that angle. I am currently working on coordination (through s3, not node to node communication), so that you can just spawn instances without thinking about it.
- corv 3 days ago
  Looks like the underdog beats it handily and easier deployment to boot. What's the catch?
  aeblyve 3 days ago
  ZeroFS is a single-writer architecture and therefore has overall bandwidth limited by the box it's running on.
  JuiceFS scales out horizontally as each individual client writes/reads directly to/from S3, as long as the metadata engine keeps up it has essentially unlimited bandwidth across many compute nodes.
  But as the benchmark shows, it is fiddly especially for workloads with many small files and is pretty wasteful in terms of S3 operations, which for the largest workloads has meaningful cost.
  I think both have their place at the moment. But the space of "advanced S3-backed filesystems" is... advancing these days.
  undefined 3 days ago
  [deleted]
- undefined 3 days ago
  [deleted]
- victorbjorklund 3 days ago
  Can SQLite run on it?
eru 3 days ago
Distributed filesystem and POSIX don't go together well.