• the_mitsuhiko 5 hours ago

    > One of the unique attributes of the (in-progress) Vortex file format is that it encodes the physical layout of the data within the file's footer. This allows the file format to be effectively self-describing and to evolve without breaking changes to the file format specification.

    That is quite interesting. One challenge in general with parqet and arrow in the otel / observability ecosystem is that the shape of data is not quite known with spans. There are arbitrary attributes on them, and they can change. To the best of my knowledge no particularly great solution exists today for encoding this. I wonder to which degree this system could be "abused" for that.

    • sa46 6 minutes ago

      Parquet also encodes the physical layout using footers [1], as does ORC [2]. Perhaps the author meant support for semi-structured data, like the spans you mention.

      [1]: https://parquet.apache.org/docs/file-format/

      [2]: https://orc.apache.org/specification/ORCv2/#file-tail

      • robert3005 2 hours ago

        The thing we are trying to achieve is to be able to experiment and tune the way data is groupped on disk. Parquet has one way of laying data out, csv is another (though it's a text format so a bit moot), ORC is another, Lance has yet another different method. The file format itself stores how it's physically laid out on disk so you can tune and tweak physical layouts to match the specific storage needs of your system (this is the toolkit part where you can take vortex and use it to implement your own file format). Having said that we will have an implementation of file format that follows particular layout.

        • gigatexal 5 hours ago

          As someone who works in data schema on read formats like parquet are amazing. I hate having to guess schemas with CSVs.

          • physicsguy 3 hours ago

            Pandera is quite nice for at least forcing validation in Pandas for this

          • jnordwick 2 hours ago

            If it's in the footer, then I'm pending to the columns out of the question it seems without moving the footer.

            • cle 2 hours ago

              Isn't this what the Arrow IPC File format does too? Is there something unique about this?

              • _willmanning 2 hours ago

                Compression! Vortex can easily be 10x smaller than the equivalent Arrow representation (and decompresses very quickly into Arrow)

            • ericyd 5 hours ago

              Thank God this file format is written in Rust, otherwise I'd be extremely skeptical.

              • smartmic 3 hours ago

                It‘s funny how „written in Rust“ has become a running gag here on HN - but only if mentioned already in the title…

                • keybored an hour ago

                  Is this a pun or something?

                  • ericyd an hour ago

                    I was being sarcastic, yes. Also the title used to include "written in Rust"

                  • neeh0 5 hours ago

                    It gave me a moment of pause why Rust is part of the equation, but I concluded I'm too dumb

                    • aduffy 6 minutes ago

                      Buried under the memes/vibes there is an actual reason this is important for data tools.

                      The previous generation of analytics/"Big Data" projects (think Hadoop, Spark, Kafka, Elastic) were all built in the JVM. They were monolithic distributed systems clusters hosted on VMs or on-premise. They were servers with clients implemented in Java. It is effectively impossible to embed a Java library into anything non-Java, the best you can do is fork a JVM with a carefully maintained classpath and hit it over the network (c.f. PySpark). Kafka has externally maintained bindings that lag the official JVM client.

                      Parquet was built during this era, so naturally its reference implementation was written in Java. For many years, the only implementation of Parquet was in Java. Even when parquet-cpp and subsequent implementations began to pop up, the Parquet Java implementation was still the best maintained. Over time as the spec got updated and new features made their way into Parquet, different implementations had different support. Files written by parquet-cpp or parquet-rs could not be opened via Spark or Presto.

                      The newer generation of data analytics tooling is meant to be easily embedded, so that generally means a native language that can export shared objects with a C ABI that can be consumed by the FFI layer of different languages. That leaves you a few options, and of those Rust is arguably the best for reasons of tooling and ecosystem, though different projects make different choices. DuckDB for example is an extremely popular library with bindings in several languages and it was built in C++ long after Rust became in-vogue.

                      While Vortex doesn't (yet) have a C API, we do have Python bindings that we expect to be the main way people use it.

                      • beAbU an hour ago

                        For a while "written in Rust" was sort of a "trust me, bro" label. The hivemind asserted that something written in rust must be automatically good and safe, because rust is good and safe.

                        Thank god everyone wisened up. The tool maketh not the craftsman. These days the "written in rust" tag is met with knee jerk skepticism, as-if the hive mind over corrected.

                    • jagged-chisel 4 hours ago

                      “Vortex is a toolkit for working with compressed Apache Arrow arrays in-memory, on-disk, and over-the-wire.”

                      So it’s a toolkit written in Rust. It is not a file format.

                      • _willmanning 4 hours ago

                        Perhaps that verbiage is just confusing. "On-disk" sort of implies "file format" but could be more explicit.

                        That said, the immediate next line in the README perhaps clarifies a bit?

                        "Vortex is designed to be to columnar file formats what Apache DataFusion is to query engines (or, analogously, what LLVM + Clang are to compilers): a highly extensible & extremely fast framework for building a modern columnar file format, with a state-of-the-art, "batteries included" reference implementation."

                        • jagged-chisel an hour ago

                          “Vortex is […] a highly extensible & extremely fast framework for building a modern columnar file format.”

                          It’s a framework for building file formats. This does not indicate that Vortex is, itself, a file format.

                          • aduffy 24 minutes ago

                            Will and I actually work on Vortex :wave:

                            Perhaps we should clean up the wording in the intro, but yes there is in fact a file format!

                            We actually built the toolkit first, before building the file format. The interesting thing here is that we have a consistent in-memory and on-disk representation of compressed, typed arrays.

                            This is nice for a couple of reasons:

                            (a) It makes it really easy to test out new compression algorithms and compute functions. We just implement a new codec and it's automatically available for the file format.

                            (b) We spend a lot of energy on efficient push down. Many compute functions such as slicing and cloning are zero-cost, and all compute operations can execute directly over compressed data.

                            Highly encourage you to checkout the vortex-serde crate in the repo for file format things, and the vortex-datafusion crate for some examples of integrating the format into a query engine!

                      • gkapur 5 hours ago

                        Not an expert in the space at all and it does seem like people are exploring new file and table formats so that is really cool!

                        How does this compare to Lance (https://lancedb.github.io/lance/)?

                        What do you think the key applied use case for Vortex is?

                        • gazpacho 4 hours ago

                          Very cool! Any plans to offer more direct integrations with DataFusion, e.g. a `VertexReaderFactory`, hooks for pushdowns, etc?

                        • Havoc 5 hours ago

                          Can one edit it in place?

                          That’s the main thing currently irritating me about parquet

                          • aduffy 4 hours ago

                            You're unlikely to find this with any analytic file format (including Vortex). The main reason is that OLAP systems generally assume an immutable distributed object/block layer (S3, HDFS, ABFS, etc.).

                            It's then generally up to a higher-level component called a table format to handle the idea of edits. See for example how Apache Iceberg handles deletes https://iceberg.apache.org/spec/#row-level-deletes

                            • slotrans 3 hours ago

                              This is true, and in principle a good thing, but in the time since Parquet and ORC were created GDPR and CCPA are things that have come to exist. Any format we build in that space, today, needs to support in-place record-level deletion.

                              • aduffy 3 hours ago

                                Yea so the thing you do for this is called "compaction", where you effectively merge the original + edits/deletes into a new immutable file. You then change your table metadata pointer to point at the new compacted file, and delete the old files from S3.

                                Due to the way S3 and the ilk are structured as globally replicated KV stores, you're not likely to get in-place edits anytime soon, and until the cost structure incentivizes otherwise you're going to continue to see data systems that preference immutable cloud storage.