Smaller things are faster to copy etc.,. The fun part is that the opposite is true as well, when you have some constant load on a service, if you make the requests faster then you will have less requests in flight at once (Little's law) and the aggregate memory consumed by those requests while they are in flight will hence be less.
That's not even the point they're really making here, IMO.
The significant decrease they talk about is a side effect of their chosen language having a GC. This means the strings take more work to deal with than expected.
This feels more like this speaks to the fact that the often small costs associated with certain operations do eventually add up. it's not entirely clear in the post where and when the cost from the GC is incurred, though; I'd presume on creation and destruction?
The cost of a string array is paid on every GC phase. That array may/contains references so the gc has to check each element every time to check if anything changed. An int array cannot contain references so it can be skipped.
edit: There are tricks to not traverse a compound object every time, but assume that at least one of the 80M objects in that giant array gets modified in between GC activations.
Even without a GC actual strings are potentially expensive because each of them is a heap allocation, if you have a small string optimisation you avoid this for small strings (e.g. popular C++ standard library string types can have up to 22 bytes of SSO, the Rust CompactString has 24 bytes) but I wouldn't expect a GC language to have the SSO.
The given task can be accomplished with not more than a few kilobytes of RAM, a constant independent of the input and output sizes, but unfortunately I suspect the vast majority of programmers now have absolutely no idea how to do so.
How about you enlighten us rather than just taunt us with your superior knowledge?
i can see how it'd be possible to transform from the input tabular format to the json format, streaming record by record, using a small constant amount of memory, provided the size of input records was bounded independent of the record count. need to maintain position offset into the input across records, but that's about it
but, maybe we'd need to know more about how the output data is consumed to know if this would actually help much in the real application. if the next stage of processing wants to randomly access records using Get(int i), where i is the index of the item, then even if we transform the input to JSON with a constant amount of RAM, we still have to store this output JSON somewhere so we can Get those items.
the blog post mentioned "padding", i didn't immediately understand what that was referring to (padding in the output format?) but i guess it must be talking about struct padding, where the items were previously stored as an array of structs, while the code in the article transposed everything into homogeneous arrays, eliminating the overhead of padding
Okay Fermat
Only real programmers know how to do that.