« BackInterning in Gomedium.comSubmitted by todsacerdoti 3 days ago
  • pjmlp 9 hours ago

    Interestingly enough, by following up some references of the article I discovered that Go is also following up on Java and .NET design decisions, that maybe could be there early on.

    - Deprecating finalizers and using cleaner queues (https://openjdk.org/jeps/421)

    - Weak references (https://learn.microsoft.com/en-us/dotnet/api/system.weakrefe..., https://docs.oracle.com/javase/8/docs/api/java/lang/ref/Weak...)

    Related tickets,

    "runtime: add AddCleanup and deprecate SetFinalizer" - https://github.com/golang/go/issues/67535

    "weak: new package providing weak pointers" - https://github.com/golang/go/issues/67552

    One day Go will eventually have all those features that they deemed unnecessary from "complex" languages.

    • nasretdinov 5 hours ago

      Note that both of those features use generics, which weren't present in the language before, so IMO Go just prefers to wait a bit before implementing useful features instead of having too many of them (ahem, Swift?)

      • pjmlp 5 hours ago

        They use generics because they are available now, otherwise most likely they would be magic functions like before.

      • badhombres 5 hours ago

        I think that’s a bit of a stretch to say go will implement all the features of c# and Java because of a few new features. Go isn’t a frozen language, they just take a lot of time and discussion before committing to a major change.

        • pjmlp 2 hours ago

          The point isn't implementing all the features of c# and Java, rather doubling down on their simplicity mantra against all programming language complexity, only to revisit such decisions years later, because after all the other programming languages had good reasons to have such features in first place.

        • pyrale 3 hours ago

          Go is an exercise in re-discovering the state of the art in programming languages. I guess they proved that experience is the teacher of fools.

        • jfoutz 15 hours ago

          Interning is neat. Most of my experience is really dated. Primarily in the JVM, and mostly for class names, for reflection and class loaders. It's sort of surprising seeing this added to go, with its desires for minimalism. But when you can use it, it can be a big win.

          Look past the "loading the whole book in memory" the author gets to the point soon enough.

          The ip address example is ok. It's true, and highlights some important points. But keep in mind pointers are 64 bit. If you're not ipv6, and you're shuffling a lot of them, you're probably better off just keeping the uint64 and converting to string and allocating the struct as needed. interning doesn't appear to be much of a win in that narrow case. but if you do care about ipv6, and you're connecting to millions of upstreams, it's not unreasonable.

          It's neat it's available. it's good to be aware of interning, but it's generally not a huge win. For a few special cases, it can be really awesome.

          ** edit uint32 for ipv4. bit counting is hard.

        • saylisteins 30 minutes ago

          Will this work across different goroutines?

          • val_deleplace a few seconds ago

            Yes, the package is designed to be thread-safe

          • nickcw 8 hours ago

            The unique package is my top feature for go1.23. I've been experimenting with it in rclone.

            People often want to have millions of S3 objects in memory and reducing the memory used would be very desirable.

            I interned all the strings used - there are a lot of duplicates like Content Type and it reduced the memory usage by about 30% which is great.

            I wonder how much difference this little fix mentioned in the article for go1.23.2 will make? https://github.com/golang/go/issues/69370

            The article also mentions strings.Clone which has been around for a while. Using that is very easy and it stops big strings being pinned into memory. I had a problem with this in the S3 backend where some string was pinning the entire XML response from the server which strings.Clone fixed.

          • survivedurcode 15 hours ago

            Beware the trade-offs of interning affecting GC behavior. Now you can’t have a stack-allocation optimization, for example.

            • tapirl 8 hours ago

              The interning feature should be only used for some rare cases, it is not intended to be used widely.

            • morkalork 15 hours ago

              This is new for Go? I remember learning about Java string interning decades ago in the context of xml parsers. If I remember correctly, there were even some memory leaks associated with it and thread locals?

              • pphysch 15 hours ago

                You could've implemented bespoke interning at any point in Go; it was added to the standard library only recently, though, likely because it may leverage Go's relatively recent support for generics.

              • cherryteastain 10 hours ago

                Cool idea, but sounds detrimental in terms of cache efficiency. Typically processing a string by reading it sequentially is quite cache efficient as the processor will prefetch, but with this method it seems like the string will not be contiguous in memory which will lead to more cache misses.

                • pimeys 8 hours ago

                  It's quite common if you need to parse a schema and keep it in memory for a long time. We're working on GraphQL federation gateway and interning strings is really useful there due to schemas sometimes being hundreds of megabytes.

                  Writing your own interner is under hundred lines of code and takes about 15 minutes. Writing a thread-safe interner is a bit trickier problem, but there are good libraries for that.

                  • ctz 6 hours ago

                    Lots of small strings sparsely spread over 282MB is pretty cache inefficient?

                    Whereas the separate small allocations will likely end up close together, which gives a better chance that a given cache line will contain several rather than just one.

                    • SwiftyBug 9 hours ago

                      Seems like the classic trade-off of compute time x memory.

                    • peterldowns 17 hours ago

                      I missed the initial blogpost about this; thanks for the solid explanation and the links. Probably won't make much of a difference for my use cases but cool to know this is now in the stdlib.

                      • pjot 14 hours ago

                          > I have a very large plaintext file and I’m loading it fully in memory. This results in a string variable book that occupies 282 MiB of memory.
                        
                        At what point does something become (very) large?
                        • sethammons 5 hours ago

                          With not really big numbers, like this one, I think of having n workers each having to do a thing. Every time you save on the memory footprint, you can grow the number of workers. Sometimes that is really really good. Usually it is probably meh

                        • favflam 10 hours ago

                          Is this the same as grpc.SharedBufferPool? The gRPC implementation does a lot of memory allocation.

                          • User23 15 hours ago

                            For reference, the term comes from Lisp’s INTERN. [1]

                            [1] http://clhs.lisp.se/Body/f_intern.htm

                            • guappa 6 hours ago

                              Eh? Seems rather trivial to me. In python one could implement it by themselves with a set, a list, and a for loop.

                              It hardly seems worthy having a library over such a specific use case.

                              • Someone 5 hours ago

                                You can’t do exactly this if you’re not in control of all code.

                                With this, if a third-party library interns a string whose content is “Foo” they get a Handle. If you also intern a string whose content is “Foo”, you get the same Handle back.

                                Having this in the standard product likely will make all code doing what you propose migrate to use unique.

                                Also, the implementation likely (I haven’t checked this implementation) will move any interned objects out of the memory range(s) inspected when doing garbage collection. That makes garbage collection do a bit less work.

                                • dewey 2 hours ago

                                  > It hardly seems worthy having a library over such a specific use case.

                                  Seems much better to have it in the stdlib like they do right now than having to import a third party dependency. Especially as such a low level package will probably be used by _other_ external packages you would be using.

                                • liotier 16 hours ago

                                  Is this essentially dictionary compression ?

                                  • nine_k 15 hours ago

                                    It's similar. It's just not by word, but by entire string, assuming that strings are immutable, long enough, and the same string values are referenced often enough.

                                    On 64-bit machines though strings shorter than 8 bytes are better off not interned.

                                    • jfoutz 15 hours ago

                                      as I read it, it's any struct! Not just strings. which is cool.

                                      • TheDong 15 hours ago

                                        Any "comparable" struct, which is to say any struct where '==' works. No, you can't override '=='.

                                        This will not work for the average struct, and if you do use it, you now no longer can include any, for example `[]byte` fields in the struct or else it will no longer compile.

                                        I always find it funny that `string` is comparable, but `[]rune` and `[]byte` are not.

                                        • aatd86 11 hours ago

                                          It's beczuse string values are immutable so at any point in a program, the result of string variable comparison would be the same.

                                          For slice types it's more complex because there is mutability implicit aliasing. And the backing array pointed at may change. I guess the latter should point us toward deep value comparison for slice types since any other comparison would be quite flimsy.

                                    • recursive 15 hours ago

                                      No. This is avoiding keeping multiple copies of the same value. There's no compression.

                                      • VWWHFSfQ 15 hours ago

                                        That's pretty much the definition of compression

                                        • recursive 14 hours ago

                                          Data compression can reduce the space required to store a single payload. That's not what's going here. I mean, I understand what you're saying, but calling this compression seems counter-productive to understand what's going on.

                                          • ithkuil 9 hours ago

                                            I think it is useful to consider this a form of lossless compression not only because it technically is a form of lossless compression but also because a simple compression algorithm like this is a good introduction to the various tradeoffs that compression algorithms can have.

                                            This is an instance of dictionary encoding where terms are words and dictionary code space is the entire machine addressing space (heap).

                                            The disadvantage of this approach is that you can only "deduplicate" words that are exactly the same and you cannot remove duplicated substrings. Another disadvantage is that the dictionary code size is 8 bytes which means you achieve some saving only for words that are longer than 8 bytes (plus some other overhead for the interning index).

                                            Compression algorithms that use dictionary encoding typically use code/symbol sizes smaller than 64 bits, often using variable length encoded numbers encoded with some kind of entropy coding that is able to use short bit sequences for codes representing dictionary entries appearing often and longer bit sequences for less frequent ones.

                                            But the idea is the same. During decompression you read each "number" from the compressed stream and you use it effectively to index an array that contains the target (sub)strings.

                                            The algorithm in the post just interprets the 64-bit dictionary code as an index in the giant array that is the machine addressing space.

                                            • rat9988 10 hours ago

                                              I'm not sure I can follow you here. Data compression cannot reduce the size of a single word if we count the dictionary size.

                                              • recursive 8 hours ago

                                                It's not impossible... but who said anything about a single word?

                                              • wjholden 10 hours ago

                                                I like the term "deduplication."

                                            • undefined 15 hours ago
                                              [deleted]
                                          • smellybigbelly 11 hours ago

                                            Couldn’t you read in the book a bit smarter by deduplicating the io stream?

                                            • guappa 6 hours ago

                                              Let's say that you didn't read the book, but merely mmaped it.

                                            • undefined 3 hours ago
                                              [deleted]