• hairtuq 6 months ago

    Here is a cute variant that doesn't need lzcnt nor tables, but only works up to 99,999:

        (((x + 393206) & (x + 524188)) ^ ((x + 916504) & (x + 514288))) >> 17
    
    This is for integer log 10, but could be adapted for number of digits. It needs a wrapper for 64 bit to invoke it multiple times, but most numbers in a JSON are small, so it might even be competitive; it needs only 4 cycles with enough instruction level parallelism.

    I gathered this idea from the output of a superoptimizer, it was fun to figure out how it works. For spoilers, see [1].

    [1] https://github.com/rust-lang/rust/blob/master/library/core/s...

    • dzaima 6 months ago

      A latency improvement would be to have digitCountThresholds be indexed by the lzcnt, instead of the result of the other LUT. Increases the size of that lookup table from 160B to 512B though. Funnily enough I've had this approach in a local copy of Ryu when I was working on cutting it down for my purposes. Unfortunately whatever ended up public had cut this out too.

      edit: some chat messages of me at [0]. Some previous unrelated discussion found at [1].

      [0]: https://app.element.io/#/room/#bqn:matrix.org/$KKIK86x0tygAf... (arbitrarily capped at 18 digits because Ryu didn't need more)

      [1]: https://github.com/ulfjack/ryu/issues/34

      • 0xcoffee 6 months ago

        C# Version:

            private static uint FastDigitCount(ulong value)
            {
                ReadOnlySpan<byte> digitCounts = [19, 19, 19, 19, 18, 18, 18, 17, 17, 17, 16, 16, 16, 16, 15, 15, 15, 14, 14, 14, 13, 13, 13, 13, 12, 12, 12, 11, 11, 11, 10, 10, 10, 10, 9, 9, 9, 8, 8, 8, 7, 7, 7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1];
                ReadOnlySpan<ulong> digitCountThresholds = [0, 9, 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999, 9999999999, 99999999999, 999999999999, 9999999999999, 99999999999999, 999999999999999, 9999999999999999, 99999999999999999, 999999999999999999, 9999999999999999999];
            
                var leadingZeros = BitOperations.LeadingZeroCount(value);
                var originalDigitCount = digitCounts[leadingZeros];
                return originalDigitCount + (value > digitCountThresholds[originalDigitCount] ? 1u : 0u);
            }
        
        Benchmark: https://stackoverflow.com/a/79337820/4503491
        • realtimechris 6 months ago

          *Optimizing uint64_t Digit Counting: A New Method that Beats Lemire's by Up to 27%*

          In the quest to improve the performance of my high-speed JSON library, JSONIFIER, I recently stumbled upon a breakthrough in optimizing the calculation of digit counts for `uint64_t` values. While Lemire’s method of using the `lzcnt` instruction for determining the number of digits in a 32-bit unsigned integer has been widely regarded as efficient, I’ve developed a new method that achieves even faster results for 64-bit unsigned integers (i.e., `uint64_t`), with significant gains across different compilers and platforms.

          ### The Existing Method: Lemire’s Approach

          Lemire’s method, known for its efficiency, calculates the number of digits in a `uint32_t` by leveraging the `lzcnt` instruction, which finds the index of the most significant bit set to 1. This is combined with a static lookup table to map the result to the corresponding number of digits.

          Here’s the code for Lemire’s method:

          ```cpp JSONIFIER_INLINE int int_log2(uint32_t x) { return 31 - simd_internal::lzcnt(x | 1); }

          JSONIFIER_INLINE int fast_digit_count(uint32_t x) { static uint64_t table[] = { ... }; return (x + table[int_log2(x)]) >> 32; } ```

          While this approach works well for 32-bit integers, the need for a faster and more efficient solution for `uint64_t` led me to create an alternative method, which uses a more streamlined approach without the overhead of a lookup table.

          ### My New Method: RTC-64-Bit Digit Counting

          I’ve designed a new approach for 64-bit unsigned integers that leverages a similar logic but optimizes the process by storing precomputed digit counts for specific ranges and applying simple threshold checks. The result is faster execution with reduced computational overhead.

          Here's the code for the new method:

          ```cpp JSONIFIER_INLINE_VARIABLE uint8_t digitCounts[]{ ... };

          JSONIFIER_INLINE_VARIABLE uint64_t digitCountThresholds[]{ ... };

          JSONIFIER_INLINE uint64_t fastDigitCount(const uint64_t inputValue) { const uint64_t originalDigitCount{ digitCounts[simd_internal::lzcnt(inputValue)] }; return originalDigitCount + static_cast<uint64_t>(inputValue > digitCountThresholds[originalDigitCount]); } ```

          This method works by using a static array to hold the precomputed digit counts and another array for threshold values that determine the exact number of digits in a `uint64_t`. The key optimization lies in the efficient use of a bit manipulation technique and direct threshold checking to avoid unnecessary computations.

          ### [Benchmark Results](https://github.com/RealTimeChris/BenchmarkSuite/blob/digit-c...)

          I ran performance benchmarks comparing my new RTC-64-bit method with Lemire’s approach and the traditional `log10` method across various platforms and compilers. The results were consistently impressive:

          #### GCC/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by *27.33%*. - *Lemire-32-bit* beats *Log10-32-bit* by a massive *814.16%*.

          #### Clang/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by *143.34%*. - *Lemire-32-bit* beats *Log10-32-bit* by *522.01%*.

          #### MSVC/Windows: - *RTC-64-bit* is *12.50%* faster than *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by *515.90%*.

          #### Clang/MacOS: - *RTC-64-bit* is *25.37%* faster than *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by *343.97%*.

          ### Key Takeaways

          The RTC-64-bit method not only delivers improved performance on modern hardware but also significantly reduces the overhead of traditional methods by eliminating the need for a large lookup table. This is especially beneficial for high-performance applications where every cycle counts, such as in JSON serialization and parsing, where speed is critical.