Comments Page - Show HN: EloqKV – Scalable distributed ACID key-value database with Redis API

« Back Show HN: EloqKV – Scalable distributed ACID key-value database with Redis APIeloqdata.comSubmitted by hubertzhang 2 months ago

apavlo 2 months ago
> EloqKV brings significant innovations to database design
What is the novel part? I read your "Introduction to Data Substrate" blog article and the architecture you are describing sounds like NuoDB from the early 2010s. The only difference is that NuoDB scales out the in-memory cache by adding more of what they call "Transaction Engine" nodes whereas you are scaling up the "TxMap" node?
See also Viktor Leis' CIDR 2023 paper with the Great Phil Bernstein:
* https://www.cidrdb.org/cidr2023/papers/p50-ziegler.pdf
* https://youtu.be/tiMvcqIfWyA
- eloqdata 2 months ago
  Thank you, Andy! This is Jeff, CEO and Chief Architect of EloqData. It's a great honor for us to have THE Andy Pavlo join the discussion on our first HN submission.
  If I remember correctly, NuoDB uses a shared cache with a cache coherence protocol, whereas EloqKV uses a shared nothing (partitioned) cache. The former is a local read but needs to broadcast each write to all nodes. The latter has no broadcast for writes but may be a remote read. The tradeoff is evident and we are actively exploring opportunities to strike a balance, e.g., for frequently-read, rarely-write data items, use the shared cache mode.
  We appreciate you pointing us to the CIDR paper. I had the pleasure of working with Phil for some time and fondly remember many discussions with Phil on various topics many years ago. To address your question, yes, we've been trying to solve the research challenges presented in the CIDR paper. The devil is in the details. We've developed numerous new algorithms and invested significant engineering effort into the design and implementation of our products. The benefits are as follows:
  - Optimality: We believe we have an overall design that optimizes synchronous disk writes and network round-trips. For instance, when the design is reduced to a single node, its performance matches or exceeds that of single-node servers. As you might expect, a lot of innovation has gone into making distributed transactions as efficient as non-distributed ones, comparable to those in MySQL or PostgreSQL.
  - Modularity: Our architecture allows us to easily replace the Parser/Compute layer and Storage/Persistence layer with the best existing solutions. This means we can create new databases by leveraging existing parsers and compute engines from current database implementations to achieve API-compatibility, as well as leveraging existing high-performance KV stores for the persistence layer. This allows us to avoid reinventing the wheels and to take advantage of decades of innovations in the database community.
  - Scalability: The entire system operates without a single synchronization point—not even a global sequencer. We drew many inspirations from the Hekaton and your TicToc paper. All four types of resources (CPU, Memory, Storage, Logging) can be scaled independently, as we mentioned earlier. More importantly, they can scale dynamically to accommodate workload changes without service disruptions.
  We look forward to sharing more technical details as we move out of stealth mode. I hope to continue this conversation with you in person in the near future.
  dbms1 2 months ago
  That is really interesting. It indeed reminds me a bit on this research project https://github.com/DataManagementLab/ScaleStore (the one Andy posted the analysis)
fizx 2 months ago
Redis's transaction api is terrible, and doesn't work across shards. Any reasonable transactions are done in Lua, and because Lua mostly works well, there's not a lot of pressure to fix transactions.
However, if you're giving redis access to different tenants, Lua is too dangerous.
I'd love to see a "real" transaction API for Redis.
- eloqdata 2 months ago
  Thank you for the insightful reply. I agree, Redis's transaction API ("MULTI/EXEC/DISCARD/WATCH") is challenging compared to SQL's familiar "START/COMMIT/ROLLBACK," and it has key limitations:
  1. No rollback on failure: If a command in EXEC fails, Redis can't roll back the transaction. EloqKV resolves this by starting real transactions on the TxServer node during EXEC. If any command fails, the changes are rolled back, maintaining atomicity. Additionally, EloqKV allows specifying isolation levels when using Redis transactions.
  2. Lack of interactivity: This is where Lua shines. Users can embed business logic in Lua, functioning similarly to stored procedures in SQL.
  3. No cross-shard transactions: Redis clusters can't redirect requests across shards, forcing clients to manage topology awareness. EloqKV addresses this with a fully distributed transactional design, eliminating these cross-shard complexities. For more on this, see our blog.
  https://www.eloqdata.com/blog/2024/08/22/benchmark-cluster#w...
  As for your point on Lua's risks in multi-tenant environments, I completely agree. Lua lacks robust ACLs or resource limitations, and managing SHA keys is cumbersome. We're considering enhancements to address these issues in the future. Finally, we are indeed working on a "real" transaction API for Redis. Like SQL, users will be able to begin transactions, read keys, apply transformations, and generate new keys, with the ability to commit or abort. EloqKV will also support configurable isolation levels and transaction protocols (OCC/Locking).
  Do you have any additional API preferences beyond "START/COMMIT/ROLLBACK" for Redis transactions?
  namibj 2 months ago
  A way to tie transaction success to another system's transaction success is a powerful ability to have. Think Kafka watermarks and just general 2-phase commit transaction systems.
  fizx 2 months ago
  My dream here is probably webassembly transactions with bounded resource consumption
- netpaladinx 2 months ago
  Talking about transactions in Redis, one area came on top of my head is metadata in file systems. I've seen colleagues/collaborators run large-scale training on a distributed file system w/ a billion files, which puts a lot of pressure on the metadata part. They tried a few options and Redis was one of them. It's fast and Lua is good enough to support metadata ops. But the thing is that it cannot scale (or Lua is gone) and may lose data from time to time, which is annoying. It looks like this durable-transactional combination may fit in. Will wait to see how this is unfolded.
the_precipitate 2 months ago
This is really interesting! The performance looks impressive. I’ve always struggled with the Redis/MySQL two-tier architecture because they are two completely different systems. Porting an application is always tricky due to API incompatibility, and maintaining consistency between the two is a huge hassle. If there’s any issue on the cache side (like performance jitters or a few servers going down), the DBMS part can quickly get overwhelmed. This kind of cascading failure is a common problem in distributed systems. There are some great discussions on this topic over at https://danluu.com/cache-incidents/. BTW, although the formatting is a bit rough, it's a great read for anyone interested in caching. Back to the topic, I always hoped that new advancements in database systems can alleviate or eliminate this problem, your work seems to be on the right track.
- hubertzhang 2 months ago
  Thank you for your interest. I agree—if Redis (as a cache) and MySQL (as a store) are deployed separately, we can only achieve eventual consistency. That’s precisely why we’re integrating cache and storage into a single system, much like EloqKV.
jacobn 2 months ago
Jepsen test?
- hubertzhang 2 months ago
  Yes, this is a top priority in our mind, and we aim to pass the Jepsen test before official GA release.
hodr_super 2 months ago
It's interesting that you have a Cassandra version. I usually think of Cassandra as a standalone database. Why would you layer one database on top of another?
- hubertzhang 2 months ago
  Good question! As explained in the linked article, our core technology, Data Substrate, is modular by design and requires a storage layer to handle the actual data (more precisely store the checkpoints). This storage layer can be any data store with a Key-Value interface: it doesn't need to meet the consistency and isolation requirements of a full database.
  We chose RocksDB and Cassandra to showcase two different approaches. RocksDB offers efficient local storage, but if there's a disk hardware failure, the data can be lost. Cassandra, on the other hand, ensures high availability by replicating data, making it resilient to disk failures. We also support cloud storage options like DynamoDB and BigTable in our cloud offerings.
henning 2 months ago
Show us the Jepsen tests. Hire Aphyr to test your database.
- hubertzhang 2 months ago
  Yes, this is a top priority for us, and we aim to pass the Jepsen test before the official GA release.
hodr_super 2 months ago
It does seem that there are a lot of activities to replace/enhance/improve Redis lately. Dragonfly, Garnet from Microsoft, and the BSD licensed Redis fork Valkey, to name just a few. Now this.
- hubertzhang 2 months ago
  Yeah, it's just a happy coincidence that our product release lined up with the Redis license change. We think EloqKV is a way more powerful and versatile product than Redis. It can definitely be used as a simple in-memory cache, just like Redis, Valkey, or Dragonfly. But the real strength of EloqKV is in its full database capabilities: scalability, consistency, and high availability, all the things you'd expect from a modern database.
  hodr_super 2 months ago
  I took a quick look at the evaluation you did on your blog section, and the numbers look amazing, a little too good to be true to be honest. It must be quite some effort to build this, who are you guys?
  hubertzhang 2 months ago
  By the way, I'm Hubert, the CTO of EloqData, I'm the one who submitted the post and happy to answer any questions you have.
PeterZaitsev 2 months ago
What is license ? I see Download but it is not clear if it is Open Source
- hubertzhang 2 months ago
  We are not yet open sourcing the code on our github site yet. The current release is a prebuilt binary, user can do whatever you want with the software (there is a license file in the downloaded tarball), EloqKV is not yet production ready, so please don't use it in a production environment.
  We are a small team of developers, and currently EloqKV is still under heavy development. We would like to maintain agility and quickly iterate and improve our product with consistency with our customer's feedback. However, we are actively evaluating multiple paths to open sourcing our technology and any suggestions and concerns will certainly be kept in our mind as we progress.
  PeterZaitsev 2 months ago
  Hi,
  Having Open Source Project on GitHub does not mean you need to become less agile. First there will not be instantly 1000 of people sending you pull requests, second even if they do - you have no obligation to take them in.
  However if you say, we can write crap code... and no one would know because they only see binaries, this is not really a way gain a good confidence
  In my opinion those days just having a "binary you can use" is the worst way to go for something like database/data store. If you are not sure about your Open Source plans having something like SaaS solution with free tier so users can experiment easily could be way to go
  hubertzhang 2 months ago
  Thank you for the suggestions. We are currently developing a cloud-native version of EloqKV, with both Serverless and BYOC (Bring Your Own Cloud) options underway. The binary is a technology preview for those interested in experiencing the new architecture. EloqKV is built with a cloud-native focus, featuring a quaternary decoupling architecture that enables independent scaling of CPU, memory, storage, and logging. This design maximizes cloud elasticity for optimal scalability and resource efficiency. So I completely agree with the suggestion to offer a SaaS solution, allowing users to experiment.
  gregwebs 2 months ago
  AGPL is open source but also seems to defend against large companies and cloud providers using without paying for a license. With a contributor agreement in place you can sell a different license to them.
  DB sales of open source are based on cloud and security. You can keep some security as open core along with cloud deployment.
  Open source is about expanding the total potential market even if you only capture a portion of the value there.
- the_precipitate 2 months ago
  Playing devil’s advocate here, I do have some doubts about the future dynamics between open source and database systems. Andy Pavlo mentioned in his CMU DB course that nowadays new database companies are either fully closed-source or only open a small portion of non-essential code.
  If you think about it, databases are arguably a more critical piece of IT infrastructure than cybersecurity products. Yet, while cybersecurity companies are thriving with many company valuations exceeding $10 billion, database companies—especially those embracing open source—struggle to achieve similar commercial success. The few database companies that do reach high valuations are typically based on closed-source products, while those adopting open-source models often face significant challenges in becoming profitable.
  It's also worth noting that many companies in the database space have recently changed their licenses around open source or source availability. If major players like MongoDB, ElasticSearch, and Redis are all making these tough decisions to build sustainable businesses, it might not be fair to simply blame corporate greed. Without adequate returns on investment, venture funding for database development could dry up, which would ultimately stifle innovation in this crucial sector.
- hodr_super 2 months ago
  If it's not open-sourced, I’m not going to use it to store my critical data. Just saying...
  ljchen 2 months ago
  I second this. It's too risky and simply not worth it. Nevertheless, I find this "optional ACID" thing interesting. Many years ago when I was a graduate student, NoSQL was a big thing. It was widely claimed that transactions were expensive and you had to drop them in exchange to scale. I always had this question that if transactions were the culprit, why not turning them off? I later found that the relational system is such a monolith that everything (caching, concurrency control, logging, locking) is wired together in an extremely complex way and there is simply no "turning off".
  throwaway81523 2 months ago
  Redis simply serializes every operation, I thought. Transaction = run a Lua script as a single operation. I think that is ACID, if you count RAM as "durable" and doing one thing at a time as "concurrent".
  hodr_super 2 months ago
  Yeah, I think they were talking about distributed transactions, Redis only support transaction in a single instance, not in a cluster. You can not run Lua across machines.
  throwaway81523 2 months ago
  https://yourdatafitsinram.net h/t daemonology ;)
hubertzhang 2 months ago
Note that this binary release is a preview of EloqKV. We're actively developing a cloud-native version with seamless scalability and a serverless experience. Stay tuned—it's coming soon!
gregwebs 2 months ago
Is the WAL 1:1 with a TxMap instance? Is the TxMap 1:1 with a node? For a distributed transaction with distributed storage and multiple nodes how does a transaction get coordinated?
- hubertzhang 2 months ago
  Good question! WAL and TxService(TxMap) are fully decoupled, allowing you to deploy them on the same node or across different machines. If your workload is write-intensive, you might consider a large WAL cluster, which leverages multiple disks to perform fsync operations in parallel. This is particularly important in cloud environments, where local SSDs are ephemeral and cloud-native databases often use EBS for WAL. However, high-performance EBS options like io2 can be costly. EloqKV's decoupled architecture allows you to scale write throughput up to 600K Write OPS on a single TxServer as you increase the number of cheap gp3 disks. Since WAL logs can be truncated after a checkpoint, the required disk size is typically quite small, making gp3 a cost-effective choice.
  For more details, please check our scaling disks benchmark report.
  https://www.eloqdata.com/blog/2024/08/25/benchmark-txlog#exp...
  > For a distributed transaction with distributed storage and multiple nodes how does a transaction get coordinated?
  EloqKV is a multi-writer system, similar to many distributed databases (FoundationDB, TiDB, CockroachDB), but we have a set of new innovations on transaction protocols, for example, the entire system operates without a single synchronization point, not even a global sequencer. We drew many inspirations from the Hekaton and the TicToc paper.
AtlasBarfed 2 months ago
Yet another distributed database announcement with nary a mention of CAP tradeoffs. Then again, no sales pitch starts with the limitations. Also, one man's ACID isn't another person's. At least they aren't advertising SQL joins with handwaves to the distributed complexities.
It superficially sounds like a series of server processes fronting actual database servers, which sounds like another layer of partition vulnerability and points of failure. But I also had similar high level concerns about the complexity of FoundationDB, people seem satisfied as to the validity of that architecture.
I fail to see how one would scale underlying resources if the persistence is done by various storage systems, and you'd be subject to limitations of those persistence engines. That sounds like a suspect claim, like the "Data Substrate" is a big distributed cache that scales in front of delayed persistence to an actual database. Again, sounds like oodles of opportunities for failure.
"Data substrate draws inspirations from the canonical design of single-node relational database management systems (RDBMS)." Look, I don't think a good distributed database starts from a CA system and bolts on partition tolerance. I get you can get far with big nodes, fast networks, and sharding but ... I mean, they talk about B+ trees but a foundational aspect of Cassandra and other distributed databases is that B+ trees don't scale past a certain point, because there exist trees so deep/tall that even modern hardware chokes on it, and updating the B+ tree with new data gets harder as it gets bigger.
As others have said, I'll leave it to Aphyr to possibly explain.
- eloqdata 2 months ago
  You're absolutely right—discussing the CAP theorem is essential for any distributed storage system. We are preparing a detailed blog post to cover the implications of CAP in our system. In brief, our system generally aligns with PC/EC in PACELC terms (https://en.wikipedia.org/wiki/PACELC_theorem), similar to other fully distributed databases like CockroachDB and TiDB. However, this configuration is flexible. For instance, in high-performance cache applications where durability is less critical, our system can be adjusted to favor PA/EL, akin to many NoSQL systems.
  Regarding persistence, our Substrate component manages its own logging system while relies on external storage engines for checkpointing. This means that the system only depends on the performance and capacity of the external storage. Our database does not depend on the external storage engine for consistency. For example, DynamoDB offers virtually unlimited capacity and performance (in terms of QPS) in the cloud, and our Data Substrate is agnostic to whether the data is stored as a B-Tree or an LSM tree.
  We are currently preparing the software for Jepsen testing prior to its General Availability (GA) release. More technical details will be shared as we move forward, so please stay tuned.
- hubertzhang 2 months ago
  > It superficially sounds like a series of server processes fronting actual database servers, which sounds like another layer of partition vulnerability and points of failure
  I agree that introducing a middleware layer on top of databases adds more points of failure. EloqKV avoids this by integrating Compute, Logging, and Storage into a single system. In this setup, storage can be a database like Cassandra, but users will not access it directly; all requests go through EloqKV, which manages ACID transactions. EloqKV is responsible for handling crashes of the TxServer, LogServer, and Storage. You can think of the distributed Storage Engine just as a Disk in traditional DBMS. Obviously its failure will affect the system. But no more than hard disk failure. In fact, in a cloud, all disks (i.e. EBS) are actually distributed storage systems.
  This situation is a rethinking of the Redis and MySQL combination, which suffers from similar issues. Both systems can fail independently, resulting in only eventual consistency. EloqKV aims to resolve this problem.
- hubertzhang 2 months ago
  [dead]
hubertzhang 2 months ago
This binary release is a preview of EloqKV. We're actively developing a cloud-native version with seamless scalability and a serverless experience. Stay tuned—it's coming soon!
marcodiego 2 months ago
License?
- hubertzhang 2 months ago
  The current release is a prebuilt binary, giving users flexibility in how they use the software. The open-source model is still being decided, but a license file is included in the downloaded tarball.