• mynegation 3 hours ago

    “I admit I don’t understand these results. There’s clearly nothing in the runtime itself that prevents these types of speeds.” Oh, there is. The default Java serialization is sort of like “pickle” module in Python - if you are familiar. It will deal with pretty much anything you throw at it, figuring the data structures and offsets to serialize or parse at runtime. More efficient methods trade universality for speed, where the offsets and calls to read/write the parts of the structure are determined in advance. Also, hard to say without source code but there is a high chance even more efficient methods like Protobuf create a lot of Java objects and that kills cache locality. With Java, you have to go out of your way to maintain good cache locality because you give up control over memory layout for automatic memory management.

    • jffhn 20 minutes ago

      >With Java, you have to go out of your way to maintain good cache locality because you give up control over memory layout for automatic memory management.

      There is a shadowy cult in a hidden corner of the Java community, an heresy to many, only followed by a handful of obnoxious zealots inspired by the dark ages of Ada 83, C, or even assembly, who take pride in creating Java programs that only allocate a finite amount of objects regardless of the time you run them, and to which the "new" keyword is a taboo which avoidable use is assimilated to blasphemy.

      As a member of this sect, in a few cases of presenting some of our programs on some laptop, I've had dumbfounded observers looking around the laptop for the network cable linking it to the server they thought it must have been running on.

      • hedora 2 hours ago

        The article does have source code.

        I don’t think any of the examples use Java’s Serializable. The first attempt reads shorts and utf8 directly from the stream.

        • marginalia_nu 2 hours ago

          ObjectInputStream is one of the faster stream options tested.

        • jbellis 2 hours ago

          True, but none of the slow methods in the article involve this.

          • kaba0 2 hours ago

            > Also, hard to say without source code but there is a high chance even more efficient methods like Protobuf create a lot of Java objects and that kills cache locality

            I don’t think this can be claimed that easily without more info, generational GCs work pretty much like an arena allocator, with very good cache locality (think of an ArrayList getting filled with objects that are continuously allocated in short order. The objects will be right next to each other, in memory). If the objects are short-lived, they can be similarly cheap to stack allocation (thread-local allocation buffers that just bumping pointers).

            • marginalia_nu an hour ago

              GC pressure is another factor. Even in the trivial ownership case, gc:ing in a GB/s allocation environment comes at a cost.

          • charleslmunger 32 minutes ago

            I noticed:

            1. Small read buffers. No reason to sequentially read and parse gigabytes only 4kb at a time.

            2. parseDelimitedFrom created a new CodedInputStream on every message, which has its own internal buffer; that's why you don't see a buffered stream wrapper in the examples. Every iteration of the loop is allocating fresh 4kb byte[]s.

            3. The nio protobuf code creates wrappers for the allocated ByteBuffer on every iteration of the loop.

            But the real sin with the protobuf code is serializing the same city names over and over, reading parsing and hashing. Making a header with the city string mapped to an integer would dramatically shrink the file and speed up parsing. If that was done, your cost would essentially be the cost of decoding varints.

            • mike_hearn 2 hours ago

              It'd be interesting to see Cap'n'Proto be profiled here, as the whole idea of that format is to eliminate deserialization overhead entirely.

              • garblegarble an hour ago

                I know it's bad form to comment on style instead of content, but saying Smartphone enjoyers will want to switch to horizontal mode for this article due to code samples that barely fit on desktop while having the article text column shrink to less than 1/3rd of the horizontal space just feels disrespectful

                • marginalia_nu an hour ago

                  Try refresh? Should cover 60ch, but some browsers bug out when you turn for reasons i don't understand.

                  • vips7L an hour ago

                    Size 89 font for the code samples too

                    • SebastianKra an hour ago

                      Yep. I have the same frustration with GitHub and GitLab which add 3km of horizontal padding on phone screens for no reason.

                    • splix an hour ago

                      I don't think that Java serialization is designed for such a small object with just two fields. It's designed for large and complex objects. Obviously it would be slower and much larger in size that a columnar implementation designed and heavily optimized for this scenarios. It's not a fair comparison and too far from a real use case.

                      Try with nested objects and at least a dozen of fields across this hierarchy. And different structure for each row. It's still not a use case for Java serialization, but at least closer to what a real code would do.

                      Same for Protobuf, I guess. Also the JSON serialization plays the same role more or less.

                      Maybe something like Avro Data Files is better for a comparison with columnar formats.

                      • SirYwell 32 minutes ago

                        Why doesn't it mention the used Java version? And a few flame graphs would be interesting as well.

                        • twoodfin 3 hours ago

                          I’m probably missing something obvious, but what’s wrong with Apache parquet-java for this use case?

                          • marginalia_nu 2 hours ago

                            The implementation is inexorably merged with hadoop, to the point where it is not useful outside of it.

                            Parquet-floor is a shim that replaces the hadoop depenencies with drop in java.io-ones.

                          • pestatije 3 hours ago

                            are the benchmarks done properly? whats the actual test code?

                            • Jaxan 3 hours ago

                              I don’t think the conclusion need a lot of precision in the benchmark. When the suggested code from the standard library (or some tutorial) is two orders of magnitude slower, something is not right.

                              The author is right that we are wasting something somewhere when we are only operating at 2% of the possible speed of the hard disk.

                              • antonhag 2 hours ago

                                From the code samples it's hard to tell whether or not this has to do with de-serialization though. It would have been fun to see profiling results for tests such as these.

                                • marginalia_nu 2 hours ago

                                  Author here, I'm away from my computer atm, but I can cook up a repo with each test in a few hours when I get home.

                                  I designed the tests as a drag race because that mimics my real world usage.

                                  • antonhag 36 minutes ago

                                    That's nice - I'd encourage you to play around with attaching e.g. JMC [1] to the process to better understand why things are as they are.

                                    I tried recreating your DataInputStream + BufferedInputStream (wrote the 1brc data to separate output files, read using your code - I had to guess at ResultObserver implementation though). On my machine it roughly in the same time frame as yours - ~1min.

                                    According to Flight Recorder:

                                      - ~49% of the time is spent in reading the strings (city names). Almost all of it in the DataInputStream.readUTF/readFully methods.
                                      - ~5% of the time is spent reading temperature (readShort)
                                      - ~41% of the time is spent doing hashmap look-ups for computeIfAbsent()
                                      - About 50GB of memory is allocated - %99.9 of it for Strings (and the wrapped byte[] array in them). This likely causes quite a bit of GC pressure.
                                    
                                    Hash-map lookups are not de-serialization, yet the lookup likely affected the benchmarks quite a bit. The rest of the time is mostly spent in reading and allocating strings. I would guess that that is true for some of the other implementations in the original post as well.

                                    [1] https://github.com/openjdk/jmc

                                    edit: better link to JMC

                                    • cowwoc 2 hours ago

                                      Hi,

                                      Please add https://github.com/apache/fury to the benchmark. It claims to be a drop-in replacement for the built-in serialization mechanism so it should be easy to try.

                                      • marginalia_nu 2 hours ago

                                        Will do!