« BackHash Ordering and Hyrum's Laweaftan.github.ioSubmitted by ColinWright 6 days ago
  • the8472 2 hours ago

    > Can’t you just break them? They’re violating the spec!

    > No, you can’t break them.

    You can. A vendor playing nice may provide advance warnings, fallbacks, assess blast radius etc, but that's no way a given, some vendors will not act nicely and leave you out cold if you ignore the spec.

    Java 7 switched array.sort() to TimSort. It broke some users, they switched nevertheless and offered an override option for some time so people could migrate. Java 9 introduced the modules, which broke some things that tried to access internals, there were overrides for some time so people could migrate.

    By default Rust does not specify the field order of its structs, people relied on it anyway. Several releases changed the ordering algorithm, no opt-out was offered since there already was an annotation to specify a fixed layout, people just had to use it.

    It's called "API contract" for a reason, it puts obligations on both sides.

    • jerf an hour ago

      I am reminded of how Postel's Law has fallen from Iron Law of the Internet to somewhat distasteful in the past 5-10 years. Yes, it's no fun to break your clients and customers today. But if you don't do it today, and you don't do it next month, next year, or the next four years, you'll find yourself in a position where you can't do anything anymore because you've basically let Hyrum's Law "build up" in your system until there's no room to move.

      Obviously, one should not willy-nilly break customers for no reason. The point is not to create a management metric around Number Of Times We've Broken The Customer and reward the team for getting their numbers up, up, up. The point is to retain flexibility and the ability to move forward, and that is done through intelligent design up front to the extent possible, and carefully selecting when to break customers over the course of years. It's always expensive and should definitely be treated as an expense, not a benefit on its own terms. But as the Hyrum's Law "builds up" its expense to the company over all will in not really all that long overwhelm the costs of the occasional breakage to stay up-to-date with the world.

    • smarks 15 hours ago

      (from Feb 2021)

      Some of the JDK's unmodifiable collections, such as those from `Set.of()` and `Map.of()`, also randomize iteration order. At the time they were added, Go and Python also randomized iteration order. However, more recently, Python moved away from randomization. Early in the Python 3.x releases, dict iteration order was randomized. In 3.6, iteration order was insertion order, but only as an implementation detail. In 3.7, this was elevated to a guarantee.

      https://docs.python.org/3/library/stdtypes.html#dict

      (search for "dictionary order")

      There are occasional complaints about Java's unmodifiable collections' randomized order, as it can occasionally cause intermittent test failures or even failures in production. However, the advantage of the randomized order is that we've been able to reorganize the internal implementations of these data structures -- twice -- without worrying about compatibility problems caused by changing the ordering. Historically this has been a problem with other structures such as `HashMap`. Even though HashMap's order isn't guaranteed, it's predictable and stable, and changing its ordering has definitely flushed out bugs.

      • o11c 14 hours ago

        IMO the real problem is that sometimes you really do want a deterministic order (not caring which one), but the container doesn't offer any API that provides it.

        And since hash-first languages provide a broken API, there's no way to provide it for arbitrary types. Compare-first languages (like C++) generally provide it, but paying the price of tree-based structures (assuming a B-tree can't negate it) isn't actually necessary.

        Rather than either of those choices, I argue that languages really should be "key-first" - rather than having classes implement hashing or comparison themselves, they should return a tuple of the data that will be fed into said hash or comparison. Besides being a nicer API, this also prevents many common bugs.

        • lesam 4 hours ago

          C++ gives both ‘map’ and ‘unordered_map’, and in my experience idiomatic C++ uses unordered_map unless you actually need a tree.

          • tialaramex 8 hours ago

            > IMO the real problem is that sometimes you really do want a deterministic order (not caring which one), but the container doesn't offer any API that provides it.

            This would be a lot more convincing if you had an actual concrete example where this was true, rather than just insisting that "sometimes" it's true.

            • alexhornby 7 hours ago

              Few examples I’ve come across: hashing, float summation, reproducible serialisation

              • medstrom 4 hours ago

                I get two of those, can you tell me more about float summation?

                • tialaramex 3 hours ago

                  Because the exponent varies, it matters what order we add floats together, if we add them in one order we get a very different answer to another. If we expected a consistent result that's very surprising.

                  Suppose you have a billion, and also you have a million ones, all as 32-bit floating point numbers. If we begin with the billion, and add each one, the additions do nothing, because in the representation used for the billion a single one is too small to matter, so our answer is just one billion again each time -- but if we begin with the ones, adding the billion last, we've reached a million when we add the billion, that's not too small to matter and we get 1.001 billion.

                  • skratky 3 hours ago

                    Use the Kahan summation algorithm.

                    • tialaramex an hour ago

                      Kahan solves my examples but in general cannot be relied on to deliver any specific benefit - it's never worse but it may not be meaningfully better either.

                      Pairwise algorithms can promise better results but aren't applicable unless we're OK with the idea of separately storing the numbers somewhere while we run the algorithm or we provide this hash table with a random access mechanism it may otherwise have no use for.

                • tialaramex 5 hours ago

                  Huh. I guess somebody could build a specialised container for this purpose, which deliberately has some arbitrary but unchanging ordering. Certainly that looks practical in Rust (indeed it may already exist, I haven't looked).

          • impoppy 2 hours ago

            Didn’t Python tackle a similar problem? I remember that dictionaries were known to be unsorted, however, when writing tests, I’ve noticed that the items were always in the same order. I don’t remember what was I looking up, but my manager and I came to a conclusion that depending on order items of a dictionary was acceptable and having to fix the tests if they somehow broke later was an okay tradeoff. Now, reading Python docs on that topic, they briefly[1] mention that dictionaries’ list views yield items in the order they were inserted without any mentions if items are sorted or not.

            [1] briefly in this regard means I’ve seen it in the docs exactly once without much attention paid to that sentence

            https://docs.python.org/3/tutorial/datastructures.html#dicti...

            • Tarean an hour ago

              This behaviour was introduced in 3.6 (and made part of the spec in 3.7 iirc)

              From the python 3.6 change log:

              New dict implementation¶ The dict type now uses a “compact” representation based on a proposal by Raymond Hettinger which was first implemented by PyPy. The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5.

              The order-preserving aspect of this new implementation is considered an implementation detail and should not be relied upon (this may change in the future, but it is desired to have this new dict implementation in the language for a few releases before changing the language spec to mandate order-preserving semantics for all current and future Python implementations; this also helps preserve backwards-compatibility with older versions of the language where random iteration order is still in effect, e.g. Python 3.5). (Contributed by INADA Naoki in bpo-27350. Idea originally suggested by Raymond Hettinger.)

              https://docs.python.org/3.6/whatsnew/3.6.html#new-dict-imple...

            • jbandela1 21 hours ago

              https://www.youtube.com/watch?v=ncHmEUmJZf4

              Is a fun presentation by Matthew Kulukundis (designer of Google's Hash Table) with Hyrum Wright offering objections from the crowd (Hyrum's law) about the design and rollout of it at Google

              • lsy 20 hours ago

                What's the advantage of specifying it as random over specifying it as sequential in some way? Aren't both just specifying a behavior, where one behavior is potentially more useful than the other? I guess I understand the principled point that a set of hash keys is not an array. But it seems like more complexity than is necessary, and even possible to fall victim to Hyrum's law besides… you can imagine someone using random hash ordering to implicitly randomize something without needing to call an RNG explicitly.

                It seems like the most principled approach might be an re-specification of the data structure: why is "iteration" even possible over a hash table? Shouldn't the keys be specified as a "set" on which only set-appropriate operations like map can be performed?

                • kstrauser 17 hours ago

                  One advantage to randomization, and why various languages did it in the first place, is that it prevents miscreants from DoSing your service if they know its implementation.

                  Say you're operating a service and you share its source on GitHub such that anyone can see it. Your language doesn't randomize hash values. Hashes are O(1), right? Well, not if a clever attacker can send you a set of values where they know each one will hash into the same bucket. The you end up with like 9 empty buckets and 1 with 100,000 items in it. Oops!

                  Old Python pre-randomization had a dict.items() method that would yield all the keys in order, conceptually kind of like:

                    for bucket in self._buckets:
                        for item in bucket:
                            yield item
                  
                  The order of those buckets would be repeatable between runs, so bucket foo would always come first, then bucket bar. Then Python added hash randomization so that the resulting hash key was something like hash(salt+key) instead of just hash(key). Now there's no way to tell in advance which bucket an item would get filed into, and the buckets would end up more or less balanced in size.

                  Newer Pythons (since 3.6? 3.7?) do something altogether different, and I can't explain exactly how their ordered dicts work, except to say I sat through a presentation on them and thought it was freaking genius even if I could re-implement it myself without sitting down with their docs.

                  • quelltext 16 hours ago

                    > and I can't explain exactly how their ordered dicts work

                    Traditionally you simply use a doubly linked list approach on the entries (each entry maintains two additional references to the previous and next entry) for that like LinkedHashMap: https://docs.oracle.com/javase//8/docs/api/java/util/LinkedH...

                    https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/...

                    Which is also what Python seems to be doing: https://stackoverflow.com/a/34496644

                    It's fairly intuitive.

                    Do their new default (now also ordered?) dics do this differently?

                    • kstrauser 15 hours ago

                      Note that OrderedDict is an implementation in Python. CPython's dict has a different implementation. There's more about it at https://docs.python.org/3.6/whatsnew/3.6.html#new-dict-imple... and https://mail.python.org/pipermail/python-dev/2012-December/1... .

                      • quelltext 13 hours ago

                        This implementation was used from 3.6, right?

                        It's interesting that the idea mail mentions that nothing changes about the implementation (including order) but the memory layout. Which would imply insertion order was already preserved in older versions (not the case IIRC) or the idea underwent a few more changes that did in fact impact order.

                        EDIT: I couldn't quite find an answer but https://softwaremaniacs.org/blog/2020/02/05/dicts-ordered/ mentions the behavior happens since then because the implementation only tracks indices in the hash table itself and relies on maintaining entries separately in a second array that gets expanded in insertion order.

                        This would also seem straightforward but it raises a few questions such as how deletion is implemented (efficiently).

                        EDIT2: Okay, the talk (https://youtu.be/p33CVV29OG8) mentions they just leave holes until the next resize (at around 42:00).

                        Raymond also mentions there that his original idea didn't preserve ordering but happened due to an additional compacting optimization? Should probably watch the whole thing some time to get the history. Sounds like a fun talk.

                        • kstrauser 11 hours ago

                          Oh! Yeah, that was the talk, but repeated at PyCon. It’s a very clever design that wasn’t at all obvious to me.

                    • o11c 12 hours ago

                      Note that this relies on all the object classes having correct hash implementation (which is actually quite difficult); it's not really a property of the container.

                      That said, the attack is almost always on builtin types, so ...

                    • prattmic 14 hours ago

                      In addition to the DoS aspect mentioned in a sibling comment, the primary reason you would do this is to avoid constraining the implementation. If you want to change the design to improve performance, for example, being forced to match the implicit ordering of the old implementation may be very difficult.

                      It certainly may be useful to define an specific ordering. Maps ordered by insertion time and maps ordered by key order are fairly common. But maps with these constraints probably won't be as fast as those with arbitrary ordering, so there is a trade-off decision to be made. A map constrained to the arbitrary ordering of the initial implementation is the worst of both worlds: difficult to make faster despite not having a particularly useful ordering.

                      As a concrete example, I am currently in the middle of rewriting Go's map implementation to use a more efficient design (https://go.dev/issue/54766). If Go hadn't randomized map iteration order this work would likely be impossible. It is unlikely we could completely change the design while keeping iteration order identical to the old implementation (assuming we want the new version to be faster at least).

                      • smarks 19 minutes ago

                        Cool! Yes, that’s been our experience in Java as well. The randomized order of unmodifiable Set and Map (from `Set.of` and `Map.of`) have enabled us to make optimizations to the internals mostly with impunity. We’ve done so twice since their introduction and we might do so again.

                      • chickenbig 4 hours ago

                        > why is "iteration" even possible over a hash table?

                        Going through all items in a container seems like a logical thing to want to do.

                        > Shouldn't the keys be specified as a "set" on which only set-appropriate operations like map can be performed?

                        Efficient implementation (in space and time) of these set operations would limit the possible efficient implementations of a hash table (i.e. to using the hash value as a key in an ordered map).

                        Going one step further, this guarantee would freeze behaviour of the hash function (and how they are combined) for all time; optimisations for different architectures would not be possible, nor would avoiding hash collision attacks by changing seeds.

                        Sometimes hash value stability is what you want (e.g. when transmitting data in space and time), so "frozen" hash functions give these guarantees e.g. https://github.com/google/highwayhash#versioning-and-stabili... .

                        Yes it can be annoying when tests are flaky, but the unit testing/matcher library should have some reasonably efficient way of matching unordered containers.

                        • CJefferson 5 hours ago

                          Unfortunately, map can (and does) have external effects in most languages, so your code could still be effected by the external order.

                          I’ve actually worked on some maths code where we wanted to prove algorithms worked correctly in exactly this type of situation, so we wanted to prove our code iterating over a hashed container was order-invariant. You can do it, but involves pulling in theorem provers, and is beyond what most people would probably tolerate.

                          • fwip 20 hours ago

                            I think removing methods from the Hashmap was not in scope for their work - they were working on the JDK, and presumably didn't have the bandwidth to patch every single Java library used at Google to use set-based methods instead of iteration-based.

                            Also, I think in practice you'll find you want to extract items from a set in some order, so you need some kind of transformation from set->list. e.g: you want to print them out.

                            Edit: Forgot to address your main question. If the specification requires a specific ordering, the implementation is forever bound to do that, even if other implementations would be more efficient/secure/other-desirable-properties. By introducing randomness, you reduce the risk of people accidentally relying on unspecified behavior, and are more able to change your implementation later.

                            • zanecodes 16 hours ago

                              One solution could be adding a `sort` argument (which takes a function that compares two `<T>` items and returns `true` or `false` depending on their order) to all functions on unordered collections of `<T>` items in which an order must be chosen, or requiring that the items in such collections implement a `TotalOrder` interface or something similar. This isn't very ergonomic in languages that don't have an equivalent of Traits or typeclasses though. In languages which permit side effects, this would include any functions that iterate over the items in an unordered collection.

                          • mseepgood 9 hours ago

                            Just don't make it uniformly distributed random, otherwise people will depend on it as a random number generator, what you don't want either.

                            • medstrom 4 hours ago

                              They might depend on it even if it's not uniformly distributed :)

                            • stygiansonic 13 hours ago

                              I wrote about something similar, which was motivated by an issue I saw caused by an (incorrect) expectation that a Java hashmap iteration order would be random: https://peterchng.com/blog/2022/06/17/what-iteration-order-c...

                              • OskarS 5 hours ago

                                I wanna say that Abseil's hash function does something like this? It was a couple of years ago, but I remember tearing my hair out figuring out why hash tables passed from a C++ api to my tests were failing. Turns out, the tests and the API was compiled by my build system as separate libraries and dynamically linked, and that abseil seeded their hash function with the address of a dummy global variable (which had a different location in the main program and the tests).

                                Presumably because they didn't want people to depend on the value of the hash function because of Hyrum's Law, but that meant the hash tables couldn't be passed across a DLL boundary, which seemed like an insane tradeoff to me. But hey, I'm not google, I get why they would do it.

                                • emmelaich 10 hours ago

                                  Shades of Rusty's API Design Manifesto.

                                  • hoistbypetard 19 hours ago
                                    • ljnelson 21 hours ago

                                      Interesting note: the JDK deliberately randomizes iteration order of certain immutable sets and maps (https://github.com/openjdk/jdk/blob/jdk-23%2B37/src/java.bas...).

                                      • sorokod 21 hours ago

                                        Hash iteration order is a great example of Hyrum’s Law

                                        An even better example (requires less words) https://xkcd.com/1172/

                                        • mise_en_place 20 hours ago

                                          Alternatively, maintain backward compatibility as much as possible. Function overloading might be too clever. Extra verbosity in exchange for clarity might be a worthwhile tradeoff. WET is acceptable in these scenarios.