Of interest and relevance: This past semester, Andy Pavlo's DB seminar at CMU explored a number of projects under the heading 'Database Building Blocks', starting with DataFusion and several of its applications. Take a listen!
https://www.youtube.com/playlist?list=PLSE8ODhjZXjZc2AdXq_Lc...
I feel like I'm not the target audience for this. When I have large data, then I directly write SQL queries and run them against the database. It's impossible to improve performance when you have to go out to the DB anyway; might as well have it run the query too. Certainly the server ops and db admins have loads more money to spend on making the DB fast compared with my anti-virus laden corporate laptop.
When I have small data that fits on my laptop, Pandas is good enough.
Maybe 10% of the time I have stuff that's annoyingly slow to run with Pandas; then I might choose a different library, but needing this is rare. Even then, of that 10% you can solve 9% of that by dropping down to numpy and picking a better algorithm...
I agree. The main reason I shared it is because I find it interesting as a library. I actually use it behind the scenes to build https://telemetry.sh. Essentially, I ingest JSON, infer a Parquet schema, store the data in S3 with a lookaside cache on disk, and then use DataFusion for querying.
You’re right it isn’t for you.
It’s largely for companies who can’t put everything in a single database because (a) they don’t control the source schema e.g. it’s a daily export from a SaaS app, (b) the ROI is not high enough to do so and (c) it’s not in a relational format e.g. JSON, Logs, Telemetry etc.
And with the trend toward SaaS apps it’s a situation that is becoming more common.
There is a cambrian explosion in data processing engines: DataFusion, Polars, DuckDB, Feldera, Pathway, and more than i can remember.
It reminds of 15 years ago where there was JDBC/ODBC for data. Then when data volumes increased, specialized databases became viable - graph, document, json, key-value, etc.
I don't see SQL and Spark hammers keeping their ETL monopolies for much longer.
Why would this be useful over of DuckDb? (earnest question)
They’re similar, but DuckDb is more of a batteries-included database whereas DataFusion is an embeddable query engine. You can use DuckDb in embedded-ish scenarios, but it’s not primarily targeting that use case. To put it another way, DataFusion is sometimes described as “the LLVM of databases.”
Another difference is that DuckDb is written in C++ whereas DataFusion is in Rust, so all the usual memory-safety and performance arguments apply. In fact DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year to optimize its performance.
How does this compare/contrast to polars? Seems pretty similar, anybody tried both?
DataFusion and Polars are like two sides of the same Rust coin: DataFusion is built for distributed, SQL-based analytics at scale, serving as the backbone for data systems and enabling complex query execution across clusters. Polars, on the other hand, is laser-focused on blazing-fast, single-node data manipulation, offering a Python-like DataFrame API that feels intuitive for exploratory analysis and in-memory processing.
And the thing is - single node can still scale ridiculously high without the orchestration overheads of distributed stuff.
You can do dual AMD 192 core CPU's (384 cores / 768 threads) with 9 TB of memory and a 24 disk SSD array in a 2U box.
Exactly, datafusion is implied batteries included apache bigdata ecosystem. Polars is chasing the Python Pandas crowd and uses python syntax, handy if you're already comfortable with ipython.
Can't you use DataFusion single node/without any Apache ecosystem stuff? They have a Python library and DataFusion is "just" a query engine. (If anything, I'd call Pandas the batteries included option...)
I think the difference is more that DataFusion is built as a library so you can plug it into the product you're building (e.g. Comet, which plugs it into Spark, or pg_lakehouse, which plugs it into Postgres). Polars could be used that way, but it's also a functional package you can pip install and use as a Pandas alternative right now.