• rebanevapustus 2 hours ago

    Big fan of Feldera here.

    I would advise everybody to stay clear of anything that isn't Feldera or Materialize. Nobody aside from these guys have a IVM product that is grounded on proper theory.

    If you are interested in trying out the theory (DBSP) underneath Feldera, but in Python, then check this out: https://github.com/brurucy/pydbsp

    It works with pandas, polars...anything.

    • jamesblonde 38 minutes ago

      It's based on Z-Sets - a generalization of relational algebra. Many of the aggregations, projections, filters from SQL are associative and can be implemented in Z-Sets. Z-Sets supports incremental operations (adding one value to a set while computing the 'max' is just the max of the two arguments - rather than requiring recomputing the 'max' over the entire set.

      • nikhilsimha 22 minutes ago

        dumb question: how do z-sets or feldera deal with updates to values that were incorporated into the max already?

        For example - max over {4, 5} is 5. Now I update the 5 to a 3, so the set becomes {4, 3} with a max of 4. This seems to imply that the z-sets would need to store ALL the values - again, in their internal state.

        Also there needs to be some logic somewhere that says that the data structure for updating values in a max aggregation needs to be a heap. Is that all happening somewhere?

      • shuaiboi 11 minutes ago

        This is pretty neat but I'm wondering how well this implementation obeys dataframe algebra. Ponder goes into detail about how dataframes and relations aren't the same, but your dataframe zset seems to be more or less the exact same thing as the relation zset?

        https://youtu.be/7TyIjqvfWto?si=CMFH30DFEWxkltlw&t=1095

      • jacques_chester 3 hours ago

        I remember seeing a VMware-internal presentation on the DDlog work which led to Feldera and being absolutely blown away. They took a stream processing problem that had grown to an hours-deep backlog and reduced it to sub second processing times. Lalith & co are the real deal.

        • lsuresh 2 hours ago

          Thank you jacques_chester! Piping all that credit to my co-founders Mihai and Leonid, the key inventors.

        • ZiliangXK 41 minutes ago

          Timeplus proton OSS https://github.com/timeplus-io/proton does similar thing but with powerful historical query processing as well.

          • jitl 4 hours ago

            I’ve been following the Feldera/DBSP/Differential Datalog team for a while and am happy to see y’all stable-ish with your own venture and settling in a model more approachable than DDlog for most developers :)

            This seems much more adoptable to me in my org than DDlog was, even if I really liked DDlog much more than SQL :-(

            • lsuresh 4 hours ago

              Thanks for following our journey! There's still room for more language frontends if you'd like to contribute. :)

            • shuaiboi 36 minutes ago

              would something like dbsp support spreadsheet style computations? Most of the financial world is stuck behind spreadsheets and the entire process of productioinizing spreadsheets is broken:

              * Engineers don't have time to understand the spreadsheet logic and translate everything into an incremental version for production.

              * Analysts don't understand the challenges with stream processing.

              * SQL is still too awkward of a language for finance.

              * Excel is a batch environment, which makes it hard to codify it as a streaming calculation.

              If I understand correctly, your paper implies as long as there is a way to describe spreadsheets as a Zset, some incremental version of the program can be derived? Spreadsheets are pretty close to a relational table, but it would be a ZSet algebra on cells, not rows, similar to functional reactive programming. So dbsp on cells would be incremental UDFs, not just UDAFs?

              thoughts??

              • arn3n 5 hours ago

                If you don’t want to change your whole stack, ClickHouse’s Materialized Views do something extraordinarily similar, where computations are ran on inserts to the source table in an online/streaming manner. I’m curious how this solution compares in its set of features/gaurantees.

                • atombender 36 minutes ago

                  ClickHouse's materialized views are wonderful, but they do require very careful design up front.

                  In particular, aggregations need to be defined using the special AggregateFunction data types, which must be paired with the corresponding aggregation functions such as countMerge(). Joins are possible in CH views, but they operate in a specific way (against the insert batch) that you must know about; joins against other tables are generally a bad idea for performance, and you should use dictionaries as much as possible for fast in-memory lookup. Lastly, it's also hard to update MVs because their entire source query has to be modified as a whole. Adding a column requires declaring the whole MV, which introduces the possibility of making mistakes in your migrations.

                  CH views are really more like triggers, and so they're a little misleadingly named. Very powerful, of course. In short, a lot more "manual" than this other system.

                  • lsuresh 5 hours ago

                    For incremental computation, Feldera is just way more powerful and general. It can evaluate arbitrarily sophisticated SQL programs incrementally (tables and deeply nested layers of views). For example, it can do rolling aggregates over joins, handle late and out-of-order arrivals, can compute over infinite streams with finite state (via automatic garbage collection), and it's strongly consistent. Clickhouse's materialized views are much simpler and restricted in comparison.

                    That said, we are /not/ a replacement ever for Clickhouse or any other historical warehouse. In fact, we pair best with one of them. Have data flow through or teed into Feldera to maintain sophisticated standing queries -- maintain historical data in your warhehouse.

                  • qazxcvbnm 4 hours ago

                    Incredible… I hadn’t even noticed, and people found the holy grail and open-sourced it!

                    By the way, I was wondering about a related question. Do streaming engines typically store a copy of the data streamed to them? For instance, if I had a view to get the maximum value of a table, and the maximum value was removed, the streaming engine surely needs to get the next value from somewhere. It seems clear that the streaming engine needs at least its own snapshot of the data to have a consistent state of the computation, but duplicating the persisted data seems somewhat wasteful.

                    • lsuresh 4 hours ago

                      Thank you!

                      The state Feldera maintains depends on the queries you write and the working set or windows you're computing over. Any time there are joins, distinct or non-linear aggregations, we need to maintain state as you've guessed.

                      A cool feature in Feldera is that it can compute over infinite streams with finite state because we automate garbage collection. The user specifies lateness over data sources or even views, and with some static analysis, Feldera determines when it is safe to forget old state such that it won't affect the output of any views.

                    • bbminner 4 hours ago

                      I wonder what guarantees can be made wrt resource consumption. I suppose that'd reasonable to assume that in most (all?) cases an update is cheaper then recompute in terms of cpu cycles, but what about ram? Intuitively it seems like there must be cases that would force you to store unbounded amount of data indefinitely in ram.

                      • ben_pfaff 4 hours ago

                        (Feldera co-founder here.) There are some cases where Feldera needs to index data indefinitely, yes. For those cases, Feldera can put those indexes on storage rather than keeping them entirely in RAM.

                        In a lot of cases where one might initially think that data needs to stay around indefinitely, people actually want the results from the last hour or day or month, etc. For those cases, Feldera supports a concept called "lateness" that allows it to drop older data: https://docs.feldera.com/sql/streaming/#lateness-expressions.

                        • lsuresh 4 hours ago

                          Your intuition is correct. Incremental computation is fundamentally a time-space tradeoff. Depending on the views you write, you might end up maintaining large amounts of state. We've written about it here: https://www.feldera.com/blog/streaming-needs-storage

                          That said, Feldera has several features to keep state bounded even when computing on infinite streams. For example, we do automatic garbage collection (GC) where with some static analysis, we can figure out when it is safe to forget inputs that will no longer affect the output of views.

                          We recently ported a community member's warehouse workload to Feldera where with these features, we were evaluating As-Of joins and streaming aggregations with 1.2GB of RAM on a laptop with more than a million events/sec in perf.

                        • cube2222 9 hours ago

                          This looks extremely cool. This is basically incremental view maintenance in databases, a problem that almost everybody (I think) has when using SQL databases and wanting to do some derived views for more performant access patterns. Importantly, they seem to support a wide breath of SQL operators, support spilling computation state to disk, and it's open-source! Interestingly, it compiles queries to Rust, so an approach similar to Redshift (which compiles queries to C++ programs).

                          There's already a bunch of tools in this area:

                          1. Materialize[0], which afaik is more big-data oriented, and doesn't pipe the results back to your database, instead storing results in S3 and serving them.

                          2. Epsio[1], which I've never used, seems to be very similar to this product, but is closed-source only.

                          3. When building OctoSQL[2], this capability was also important to me and it was designed from ground up to support it. Though in practice in a tool like OctoSQL it's pretty useless (was a fun problem to solve though).

                          There's some things I'm curious about:

                          - Does it handle queries that involve complex combinations of ordering with limits in subqueries? If due to a change in an underlying table a top-n row is added, resulting in moving other rows around (and removing the current n'th) will the subsequent query parts behave as though the order was maintained when computing it, or will it fall apart (imagine a select with limit from a select with bigger limit)?

                          - Is it internally consistent[3]? They say it's "strongly consistent" and "It also guarantees that the state of the views always corresponds to what you'd get if you ran the queries in a batch system for the same input." so I think the answer is yes, but this one's really important.

                          Either way, will have to play with this, and dig into the paper (the link in the repo doesn't work, here's an arXiv link[4]). Wishing the creators good luck, this looks great!

                          [0]: https://materialize.com

                          [1]: https://www.epsio.io

                          [2]: https://github.com/cube2222/octosql

                          [3]: https://www.scattered-thoughts.net/writing/internal-consiste...

                          [4]: https://arxiv.org/pdf/2203.16684

                          • jonmoore 6 hours ago

                            The VLDB paper mentioned is https://www.vldb.org/pvldb/vol16/p1601-budiu.pdf.

                            Abstract:

                            "Incremental view maintenance has been for a long time a central problem in database theory. Many solutions have been proposed for restricted classes of database languages, such as the relational algebra, or Datalog. These techniques do not naturally generalize to richer languages. In this paper we give a general solution to this problem in 3 steps: (1) we describe a simple but expressive language called DBSP for describing computations over data streams; (2) we give a general algorithm for solving the incremental view maintenance problem for arbitrary DBSP programs, and (3) we show how to model many rich database query languages (including the full relational queries, grouping and aggregation, monotonic and non-monotonic recursion, and streaming aggregation) using DBSP. As a consequence, we obtain efficient incremental view maintenance techniques for all these rich languages."

                            • lsuresh 5 hours ago

                              Thanks for the kind words! (Feldera's CEO here)

                              - We evaluate top-k queries incrementally and the nesting shouldn't be a problem for the engine (or it'd be a bug). If you have an example of a query, we can try it out at our end.

                              - Yes. It is internally consistent. We've verified with the experiment here: https://www.scattered-thoughts.net/writing/internal-consiste....

                              Our guarantee is that we always produce the same answer as if you'd ran the queries in a batch system. All views update together. You can see the computation model here: https://www.feldera.com/blog/synchronous-streaming/

                              And thanks for the catch about the broken paper link. This is the published version: https://www.vldb.org/pvldb/vol16/p1601-budiu.pdf

                              • cube2222 4 hours ago

                                Thanks for the response and clarifications!

                                I think this scenario would illustrate it.

                                Make a table with one column, x, and insert into it rows with values 1-5, and then 8-20.

                                Then query it using more or less `SELECT x FROM (SELECT x FROM xs LIMIT 15 ORDER BY x) LIMIT 10`, and then insert 6 into the table. Output should be 1-6, 8-11. Of course as long as the limits aren't merged together during optimisation, that would make the test-case moot.

                                Good luck with your product!

                                • lsuresh 4 hours ago

                                  Thanks! Looks like that works.

                                  Here is the query I set up on try.feldera.com.

                                    CREATE TABLE foo (x INTEGER NOT NULL PRIMARY KEY) WITH ('materialized' = 'true') ;
                                  
                                    CREATE MATERIALIZED VIEW bar AS SELECT x FROM (SELECT x FROM foo ORDER BY x LIMIT 15) LIMIT 10;
                                  
                                  I then used our CLI tool fda to insert some rows and inspect the states after starting the pipeline: https://docs.feldera.com/reference/cli

                                    try.feldera.com/foo> select * from foo;
                                  
                                    +----+
                                    | x  |
                                    +----+
                                    | 1  |
                                    | 2  |
                                    | 3  |
                                    | 4  |
                                    | 5  |
                                    | 8  |
                                    | 9  |
                                    | 10 |
                                    | 11 |
                                    | 12 |
                                    | 13 |
                                    | 14 |
                                    | 15 |
                                    | 16 |
                                    | 17 |
                                    | 18 |
                                    | 19 |
                                    | 20 |
                                    +----+
                                  
                                    try.feldera.com/foo> insert into foo values (6);
                                  
                                    +-------+
                                    | count |
                                    +-------+
                                    | 1     |
                                    +-------+
                                  
                                    try.feldera.com/foo> select * from bar;
                                  
                                    +----+
                                    | x  |
                                    +----+
                                    | 1  |
                                    | 2  |
                                    | 3  |
                                    | 4  |
                                    | 5  |
                                    | 6  |
                                    | 8  |
                                    | 9  |
                                    | 10 |
                                    | 11 |
                                    +----+
                                  • cube2222 3 hours ago

                                    Awesome, thanks for double-checking!

                              • tveita 6 hours ago

                                I think Rama [1] (by Nathan Marz behind Apache Storm) is interesting as a "NoSQL" solution for a similar problem space, as I understand it. Impressive if this can support similar scale using only SQL.

                                [1] https://redplanetlabs.com/

                                • emmanueloga_ 6 hours ago

                                  Also raising wave.

                                  —-

                                  https://risingwave.com/

                                • seungwoolee518 10 hours ago

                                  When I saw the title first, I've thought that "one of the os remove l" introduces a new incremental conpute engine?

                                  Anyway, it was very impressive.

                                  • lsuresh 5 hours ago

                                    Thanks!

                                  • Nelkins 6 hours ago

                                    I would love if something like this that exposed C bindings so that every language with an FFI could use the library. I’d love to be able to define pipelines and queries in .NET instead of having to use SQL.

                                    • lsuresh 5 hours ago

                                      Hi Nelkins. We do have a Rust crate you could consider using directly: https://docs.rs/dbsp/latest/dbsp/. Our SQL compiler puts together a pipeline by generating a Rust program that uses this crate.

                                      • loxias 4 hours ago

                                        Second the desire for C bindings! (or someone showing how to wrap and call the rust bindings?)

                                        • lsuresh 3 hours ago

                                          The previous implementation we built at VMware went from datalog -> Rust, and we supported other language bindings using C bindings and FFI. The same ought to work here too.

                                      • faangguyindia 2 hours ago

                                        We just use bigquery and call it a day.

                                        Bigquery had figured this out long ago and built it in top of Big table.

                                        • jonstewart 6 hours ago

                                          How does it compare to Materialize/differential dataflow?

                                          • lsuresh 5 hours ago

                                            (Feldera's CEO here)

                                            We are based on DBSP (https://www.vldb.org/pvldb/vol16/p1601-budiu.pdf) which is an evolution of DD. DBSP gives us an algorithm to take arbitrarily complex queries and generate an incremental version of it.

                                            As a consequence, we evaluate everything incrementally. For example, we are the only engine that can perform rolling aggregates incrementally. In general, with DBSP, we can incrementalize "the full relational algebra, queries over sets and multisets, arbitrarily nested relations, aggregation, flatmap (unnest), monotonic and non-monotonic recursion, streaming aggregation, and arbitrary compositions of all of these". DBSP is a much simpler and cleaner foundation.

                                            As a product, both our enterprise and open-source (MIT licensed) offerings let you run it anywhere you want including your laptop.

                                            Positioning wise, we are a query engine with a unified way to compute over both bounded and unbounded data sources with perfect consistency, with an integrated storage layer. Materialize is going for building an operational warehouse.

                                            • ZiliangXK 39 minutes ago

                                              There is always a tipping point for quite a few use cases where incremental evaluation degrades perf compared with full batch / historical query ?

                                              • jonstewart 18 minutes ago

                                                Thank you! I will look into it further.