Comments Page - Improving Parquet Dedupe on Hugging Face Hub

« Back Improving Parquet Dedupe on Hugging Face Hubhuggingface.coSubmitted by ylow 9 months ago

ignoreusernames 9 months ago
> Most Parquet files are bulk exports from various data analysis pipelines or databases, often appearing as full snapshots rather than incremental updates
I'm not really familiar of how datasets are managed by them, but all of the table formats (iceberg, delta and hudi) support appending and some form of "merge-on-read" deletes that could help with this use case. Instead of always fully replacing datasets on each dump, more granular operations could be done. The issue is that this requires changing pipelines and some extra knowledge about the datasets itself. A fun idea might involve taking a table format like iceberg, and instead of using parquet to store the data, just store the column data with the metadata externally defined somewhere else. On each new snapshot, a set of transformations (sorting, spiting blocks, etc) could be applied that minimizes that the potential byte diff between the previous snapshot.
YetAnotherNick 9 months ago
I just don't understand how these guys could literally give terrabytes of free storage and free data transfer to everyone. I was doing some calculation of cost from my storage and transfers and if they used something like S3 it would costed them 1000s of dollar. And I don't pay them anything.
- mritchie712 9 months ago
  > As Hugging Face hosts nearly 11PB of datasets with Parquet files alone accounting for over 2.2PB of that storage
  11PB on S3 would cost ~$250k per month / $3m per year.
  HuggingFace has raised almost $400M.
  Not saying it's nothing, but probably not a big deal to them (e.g. ~10 of their 400+ staff cost more).
- fpgaminer 9 months ago
  HuggingFace really is such an amazing resource to the ML community. Not just for storing datasets, but being able to stand up a demo of my models using spaces for anyone to use? It's hard to overstate how useful that is.
- ylow 9 months ago
  We are here to help lower that :-) . As we can push dedupe to the edge we can save on bandwidth as well. And hopefully make everyone upload and download faster.
kwillets 9 months ago
One additional thought regarding query performance is that content-defined row groups allow localized joins and aggregations which are much faster than the globally-shuffled kind.
If the sharding key matches (or is a subset of) a join or group-by key, then identical values are local to a single shard, which can be processed independently.
This type of thing is typically done at large granularity (eg one shard per MPP compute node), but there are also benefits down to the core or thread level.
Another tip is that if no shard key is defined, hash the whole row as a default.
- kwillets 8 months ago
  Strike the sharding idea as I misunderstood the CDC method.
kwillets 9 months ago
I'm surprised that Parquet didn't maintain the Arrow practice of using mmap-able relative offsets for everything. Although these could be called relative to the beginning of the file.
- ylow 9 months ago
  I believe Parquet predates Arrow. That's probably why.
- sagarm 9 months ago
  They're optimized for different things.
  Arrow is designed for zero copy ipc -- it is, by definition, an in-memory format that is therefore mmappable.
  Parquet is an on-disk format, designed to be space efficient.
  So for example, Parquet supports general purpose compression in addition to dictionary and RLE encodings. General purpose compression forces you to make copies, but if you're streaming from disk the extra cost of decompressing blocks is acceptable.
  Arrow doesn't use general purpose compression because it would force copies to be made and dominate compute costs for data in memory.
jmakov 9 months ago
Wouldn't be it easier to extend delta-rs to support deduplication?
- ylow 9 months ago
  Can you elaborate? As I understand Delta Lake provides transactions on top of existing data and effectively stores "diffs" because it knows what the transaction did. But when you have regular snapshots, its much harder to figure out the effective diff and that is where deduplication comes in. (Quite like how git actually stores snapshots of every file version, but very aggressively compressed).
skadamat 9 months ago
Love this post and the visuals! Great work
kwillets 9 months ago
How does this compare to rsync/rdiff?
- ylow 9 months ago
  Great question! Rsync also uses a rolling hash/content defined chunking approach to deduplicate and reduce communication. So it will behave very similarly.
  kwillets 9 months ago
  One more: do you prefer the CDC technique over using the rowgroups as chunks (ie using knowledge of the file structure)? Is it worth it to build a parquet-specific diff?
  ylow 9 months ago
  I think both are necessary. The cdc technique is file format independent. The row group method makes Parquet robust to it.