• yodon 13 hours ago

    This looks super valuable!

    That said, it's concerning to see the reported probability for getting a 4 on a die roll is 65%.

    Hopefully OpenAI isn't that biased at generating die rolls, so is that number actually giving us information about the accuracy of the probability assessments?

    • dragonwriter 10 hours ago

      > That said, it's concerning to see the reported probability for getting a 4 on a die roll is 65%.

      Finding that an LLM is biased toward inventing die rolls that are the median result rounded to an available result by the most common rounding method is...not particularly surprising. If you want a fair RNG, use an RNG deigned to be fair, not an LLM where that would be, at best, an emergent accidental property.

      • teej 7 hours ago

        Fair dice rolls is not an objective that cloud LLMs are optimized for. You should assume that LLMs cannot perform this task.

        This is a problem when people naively use "give an answer on a scale of 1-10" in their prompts. LLMs are biased towards particular numbers (like humans!) and cannot linearly map an answer to a scale.

        It's extremely concerning when teams do this in a context like medicine. Asking an LLM "how severe is this condition" on a numeric scale is fraudulent and dangerous.

        • low_tech_love 23 minutes ago

          This week I was on a meeting for a rather important scientific project at the university, and I asked the other participants “can we somehow reliably cluster this data to try to detect groups of similar outcomes?” to which a colleague promptly responded “oh yeah, chatGPT can do that easily”.

          • Terr_ 5 hours ago

            It'll also give you different results based on logically-irrelevant numbers that might appear elsewhere in the collaborative fiction document.

          • ngrislain 13 hours ago

            Thank you! The number is the the sum of the logprobs from the token constituting the individual values. So it does represent the likelihood of seeing this value. So yes OpenAI is super-biased as a random number generator. We sampled other values from OpenAI and got other die roll values, but with much lower probs (5 has 8% chances ).

            • ngrislain 13 hours ago

              More precisely it represents the likelihood of seeing this value conditional on the tokens before it.

              • radarsat1 7 hours ago

                and i guess includes other possibilities than numbers, like 'f' which could lead to four or five. There's probably a separate probability for 'fi' and 'fo' too.

            • mmcwilliams 10 hours ago

              What about the models they offer would make you think that it wouldn't be biased at generating random die rolls?

              • supernewton 10 hours ago

                I feel like https://xkcd.com/221/ might be heavily influencing what the typical "random" die roll looks like on the internet ;)

                • prerok 9 hours ago

                  Based on this comic I've seen unit tests use 4 as replacement for random generated number to ensure non flakiness (of course, only when needed). But it might explain the LLM's bias?

                  • ngrislain 9 hours ago

                    Haha, I didn't know that one! It's consistent with OpenAI's conception of a "random" dice roll :-D. Joke appart, I'm quite convinced many people would not find 1 or 6 to look "random" enough to be chosen as an example dice roll.

                • HanClinto 9 hours ago

                  This is really brilliant stuff! Somehow I didn't realize that logprobs were being returned as part of the OAI requests, and I really like this application of it.

                  Any interest in seeing this sort of thing being added to llama.cpp?

                  • HanClinto 9 hours ago

                    Looking at llama.cpp, it already supports the logprob field in its OAI API emulation, so it shouldn't be too difficult to use this library with it.

                    It feels like this would be useful enough to build around -- I especially like the idea of asking the API to return the top K results for each field, and denoting their likelyhood -- almost like a dropdown box with percentages attached for each possible result.

                • juxtaposicion 10 hours ago

                  This looks great; very useful for (example) ranking outputs by confidence so you can do human reviews of the not-confident ones.

                  Any chance we can get Pydantic support?

                  • themanmaran 6 minutes ago

                    Fyi logprobs !== confidence.

                    If you run "bananas,fishbowl,phonebook," and get {"sponge": 0.76}

                    It doesn't mean that "placemat" was the 76% correct answer. Just that the word "sponge" was the next most likely word for the model to generate.

                    • ngrislain 8 hours ago

                      Actually, OpenAI provides Pydantic support for structured output (see client.beta.chat.completions.parse in https://platform.openai.com/docs/guides/structured-outputs).

                      The library is compatible with that but does not use Pydantic further than that.

                      • juxtaposicion 8 hours ago

                        Right the hope was to go further. E.g. if the input is:

                        ```

                        class Classification(BaseModel):

                            color: Literal['red', 'blue', 'green']
                        
                        ```

                        then the output type would be:

                        ```

                        class ClassificationWithLogProbs(BaseModel):

                            color: Dict[Literal['red', 'blue', 'green'], float]
                        
                        ```

                        Don't take this too literally; I'm not convinced that this is the right way to do it. But it would provide structure and scores without dealing with a mess of complex JSON.

                    • kelsolaar 6 hours ago

                      I briefly took a look at the code, what is the reason to use Lark and not Python native JSON parser, is it to handle cases where the structured output is not JSON compatible?

                      • potatoman22 8 hours ago

                        How does the token usage compare to vanilla structured output? Many of these libraries do multiple requests to constrain output and measure logprobs.

                        • ngrislain 8 hours ago

                          Same token usage. Actually OpenAI returns the logprob of each token conditional on the previous ones with the option logprobs=true. This lib simply parses the output json string with `lark` into an AST with value nodes. The value nodes are mapped back to a range of characters in the json string. Then the characters are mapped back to the GPT tokens overlapping the character ranges and the logprobs of the tokens are summed.

                          • potatoman22 7 hours ago

                            That's great to hear, thanks for the explanation! Super excited to try this out.

                        • Der_Einzige 9 hours ago

                          BTW - Structured/Constrained Generation is the KEY to making AI agents better/scary good. Without it, you're leaving so much on the table. This library is awesome for augmenting that capability!!!!

                          Also, if you're "studying LLM based chess" and you don't use dynamic grammar's to enforce that models can only make "valid" moves at each time step, you're research is basically invalid.

                          And don't meme me with claims that structured/constrained generation harms creativity. The devs of outlines debunked that FUD already: https://blog.dottxt.co/say-what-you-mean.html

                          Similarly, if you think that RLHF/DPO or Lora or any of that harms creativity, you're really outing yourself as not having played with high temperature sampling.

                          • ngrislain 9 hours ago

                            Thank you! Yes indeed, structured output was instrumental in reliably extracting structured data from images from a client.