• _nalply 5 hours ago

    Have also a look at concurrent_map. It is a concurrent BTreeMap, this means it maintains key order (not insertion order but the keys implement Ord).

    https://lib.rs/crates/concurrent-map

    I am using it for a transactional in-memory key-value store experiment.

    • ibraheemdev 3 hours ago

      Looks very interesting, but seems to serve a pretty different use case:

      > This is an ordered data structure, and supports very high throughput iteration over lexicographically sorted ranges of values. If you are looking for simple point operation performance, you may find a better option among one of the many concurrent hashmap implementations that are floating around. Pay for what you actually use :)

      • CyberDildonics 2 hours ago

        If you just need a key-value data concurrent data structure it should be much faster and scale better to have a hash map instead of something that is keeping a sorted order.

      • eterm 6 hours ago

        I'd love to see this put through it's paces with the following:

        I have a strategy for benchmarking such data-structures, I call it the "birthday benchmark".

        Unfortunately I'm not proficient enough in Rust to have put it into practice there, but the strategy is this:

        Generate a stream of random bytes, pre-generation is advised to avoid dominating the benchmark.

        Consume the stream in blocks of n bytes.

        Try to find a duplicated block.

        Given the "birthday paradox / birthday attack", this is actually much quicker than it might first appear. Simple single-thread basic hashset can find duplicates at a 6-byte length match in around 3 seconds on my hardware, and a specialised data-structure takes this to less than a second.

        A good concurrent hashtable should improve that greatly even further, because this ought to be a very parallelizable problem, especially if you're not constrained to finding the first such duplicate in the stream, but allow yourself to find any duplicate, nor are constrained to keep track of both sides of the pair, and are content with simply knowing you have found one.

        • vlmutolo 5 hours ago

          > Generate a stream of random bytes, pre-generation is advised to avoid dominating the benchmark.

          Current PRNGs are pretty fast. The Xoroshiro RNG "shootout" benchmark [0] lists some self-reported speeds. They claim 8+ GB/s for even their slowest, highest-quality PRNG. The general-purpose one they recommend is 10GB/s, and 32GB/s when vectorized.

          The vectorized versions get close to current DRAM speeds. I think I'd prefer that over reading from a giant table, given that the table reads will have significant implications for caching and disrupt that aspect of the benchmark.

          [0]: https://prng.di.unimi.it/#shootout

          • eterm 4 hours ago

            Thanks for that suggestion, I shall explore faster PRNGS and vactorization too then.

            My domain is c#, where it's perhaps unsurprising that it's a lot faster to have an array in memory than go via the default Random, which is I believe is Xoroshiro ( at least in .net 6+ ). It certainly can generate data quickly with Random.GetBytes(), but repeated calls to Random.NextInt64() are much slower.

            Another issue I found with generating random numbers and trying to multi-thread is thread safety / synchronisation. If each thread is doing it's own random generation then it's difficult to compare results between using 1 thread and 2 threads, because it's no longer the same single stream of random data measured across each benchmark, so the meta-randomness of how early you "should" hit a dupe becomes a factor.

            Having pre-generated numbers and single thread responsible for handing out those numbers makes it easier, you can either have a thread-safe counter, or can increment a larger increment plus an offset for each thread number. The first "real" dupe is still in the same place in the original pre-generated data.

            You can then compare the incremental gains, for example, of each thread having their own hashtable would get you logarithmic gains (I think that's the right way to express it, but essentially it's gaining just by having First Dupe = Min(First Dupe across N threads) vs an actual thread-safe concurrent hash-table, where you should[1] still see First dupe but sped up by a greater factor than just a naive work but not state sharing.

            I recognise there are potential memory caching issues at play with the pregeneration approach, but for larger bit counts the actual work should hopefully dominate, particularly since look-up values aren't revisited by design.

            [1] "Should", because there's a small chance in many algorithms that due to concurrency and race conditions you actually miss the duplicate. Either multiple runs should be tried looking for bi-modal data, or accept that the next dupe shouldn't be so long after the first, and be happy the speed-up is greater than the slow-down caused by very occassional race condition misses. The chance of a such a race condition is absolutely minuscule at the 48bit level, if we assume we are using say, 8 threads, and assume for a race condition to occur, the same 48-number would have to be concurrently being handled/generated by 2 different threads.

            • mananaysiempre an hour ago

              In my experience, vectorization is very simple—just running multiple instsnces of Xoroshiro (or one of the weaker variants) inside a vector works quite well. Fitting the resulting peg into the hole of a preexisting API is difficult, but that’s always a problem.

              • Iwan-Zotow 2 hours ago

                > which is I believe is Xoroshiro ( at least in .net 6+ )

                Xoshiro I believe is in .NET 6, close cousin

                https://blogs.siliconorchid.com/post/coding-inspiration/rand...

                • eterm 2 hours ago

                  Thanks for that correction, I hadn't appreciated the subtle difference there.

            • bloppe 4 hours ago

              This is a write-heavy workload. Papaya is optimized for read-heavy workloads.

              It's an interesting benchmark, but the kind of people who would want to use Papaya probably wouldn't be very interested in it.

              • winwang 3 hours ago

                Do you have other examples of interesting workloads? Benchmarking is difficult, lol.

              • winwang 3 hours ago

                Are there "batch"-esque workloads where we want extreme throughput, but can tolerate a large latency (10-100 micros)?

                • rurban 8 hours ago

                  This looks pretty good in terms of its tradeoffs and tricks used. Makes good use of the Metadata, as in https://greg7mdp.github.io/parallel-hashmap/

                  • gleenn an hour ago

                    "In some ways, they are the holy grail of concurrent data structures. On the other hand, a concurrent hash table is an inelegant blob of shared mutable data, often a marker of a poorly architectured program."

                    I find the last bit particularly objectionable. If you're in some language slanging objects around all day, then sticking a untyped, bespoke substitute for an object probably isn't the right move. But on the long path down my programming career, I recognize objects make so many things harder and more confusing than necessary. I definitely prescribe to Rich Hickey saying I would rather have 1 data structure and 100 functions that operate on it far easier to work with and understand than 10 data structure and 10 functions. Clojure absolutely is a joy to work with and reason about, and it only gets better the faster you get out of typed object land. Hashmaps are the purest abstraction over an associative data structure, and I will take one over a pile of classes with brittle, snowflake interfaces any day.