Comments Page - Creating a LLM-as-a-Judge That Drives Business Results

« Back Creating a LLM-as-a-Judge That Drives Business Resultshamel.devSubmitted by thenameless7741 6 days ago

Lerc 6 days ago
There are a few broad areas of risk in AI.
1. Enabling goes both ways, therefore bad actors can also be enabled by AI.
2. Accuracy of suggestions. Information provided by AI may be incorrect, be it code, how to brush one's teeth, or height of Arnold Schwarzenegger. At worst AI can respond against the users interests if the creator of the AI has configured it to do so.
3. Accuracy of Determinations. LLM-as-a-Judge falls under this criteria. This is one of the areas where a single error can magnify the most.
This post says: What about guardrails?
Guardrails are a separate but related topic. They are a way to prevent the LLM from saying/doing something harmful or inappropriate. This blog post focuses on helping you create a judge that’s aligned with business goals, especially when starting out.
That seems woefully inadequate.
When using AI to make determinations there has to be guardrails. Having looked at drafts of legislation and position statements of governments, many are looking at legally requiring that any implementers of AI systems that make determinations must implement processes to deal with the situation where the AI makes an incorrect determination. To be effective this should be a process that can be initiated by individuals affected by this determination.
- salawat 5 days ago
  >1. Enabling goes both ways, therefore bad actors can also be enabled by AI.
  >2. Accuracy of suggestions. Information provided by AI may be incorrect, be it code, how to brush one's teeth, or height of Arnold Schwarzenegger. At worst AI can respond against the users interests if the creator of the AI has configured it to do so.
  >3. Accuracy of Determinations. LLM-as-a-Judge falls under this criteria. This is one of the areas where a single error can magnify the most.
  sed s_AI_technology_g < quoted_text
  Lerc 4 days ago
  That only really applies to 1.
  Technology in general does not have the ability to advise. Advice is already a legally protected category in many fields. Legal, medical, and financial advice are each covered by their own legislation to require the advisers to act in the interest of their client.
  Technology in general has been used for years to make determinations based upon physical characteristics. Weighs more than 50g, Temperature over 20C, Item A is heavier than Item B. The distinction with AI is that it will be making semantic determinations.
  Computers have also been used to implement policy decisions in a manner that makes them appear to be making determinations. When someone says "Computer says no", it's usually a human who made a decision. Attributing it to the computer just makes it impossible to argue against.
  The world would probably benefit from there being regulation on processes to enable people to challenge incorrect determination in this area as well. We have all seen the news stories of a big tech company cutting a user from a vital service because of an arbitrary seeming decision. Often the only challenge mechanism that is effective is when the situation receives widespread publicity. While the decision causing the issue is usually human in these instances, it is shielded behind a wall of technology leaving the user with no recourse.
- nine_zeros 6 days ago
  > Having looked at drafts of legislation and position statements of governments, many are looking at legally requiring that any implementers of AI systems that make determinations must implement processes to deal with the situation where the AI makes an incorrect determination
  The real legislation we need is liability. Who is liable to suffering caused by LLMs inaccuracies?
  I think if liability should be on corporations selling the LLM as a solution.
  If a person gets arrested for selling the police a fuzzy LLM solution, and this causes unnecessary grief to the individual, the seller of the LLM service must compensate the individual with 4x the median income of their metropolitan area, for the duration of the harm caused.
  phs318u 6 days ago
  > liability should be on corporations selling the LLM as a solution
  I agree. There was a period of time (don't know if it's still the case), where Microsoft Windows licenses/ToS expressly and explicitly did not warrant the use of the OS for a whole bunch of critical use-cases (e.g. anything with the word 'nuclear' in it). If the vendor doesn't warrant it for that use, then you'd be remiss in your duty to your company to ignore that and choose to use it for that purpose. Caveat emptor.
jerpint 6 days ago
The biggest problem these days is that it’s very easy to hack together a solution for a problem that, at first glance, seems to work just fine. Understanding the limits of the system is the hard part, especially since LLMs can’t know when they don’t know
- trod123 6 days ago
  I second this, though its a bit unclear to any non-domain expert in systems or systems organization.
  Defining the problem and identifying constraints is always the hardest part and its always different for each application. You also never know what you don't know when a project starts.
  The process is inevitably a constant feedback system of discovery, testing, and discarding irrational or irrelevent results until you get to first principles or requirements needed.
  Computers as a general rule can't do this as the lowest part of von-neumann architecture can't tell truth from falsity when the inputs are the same (i.e. determinism as a property is broken). You have automation break in similar ways.
  Approximations which are what the encoded weights are, are just that, approximations, without thought process. While you can make a very convincing simulacra, you'll never get a true expert, and the process is not a net benefit overall since you end up creating cascading problems later that cannot be solved.
  Put another way, when there is no economic incentive to become a true expert in the first place, and this is only done through working the problems, the knowledge is not passed on, and then lost when the experts age and die.
  Since you at best may only be able to target what amounts to entry-level position roles, and these roles are what people use to become experts, any adoption replacing these workers guarantees this ultimately destructive outcome with any haphazard attempt. Even if you can't even meet that level of production initially, the mere claim is sufficient to cause damage to society as a whole. It more often then not fundamentally breaks the social contract in uncontrollable ways.
  The article takes the approach of leveraging domain experts, most likely copying them in various ways, but if we're being real, that is doomed to failure too for a number of reasons that are much to long to go into here.
  Needless to say, true domain experts, and really any rational person won't knowingly volunteer anything related to their given profession that will be used to economically destroy their future prospects. When they find out after-the-fact, they stop contributing or volunteering completely, as seen on reddit. These people are also more likely to sabotage these systems in subtle ways.
  This dynamic may also cause the exact opposite, where the truly gifted leave the profession entirely and you get extreme brain drain, like depicted in Atlas Shrugged.
  People can and do go on strike, withdrawing the only thing of value they have that cannot be taken. We are already seeing the beginning of this type of fallout in the Tech sector. August unemployment for Tech was 7%?, national unemployment was 1.5%, that's 4.6x the national average, and this is at peak seasonal hiring (with Nov-Mar often being hiring freezes). Tech historically has not been impacted by interest rate increases, its been bulletproof related to rate increases so the underlying cause is not interest rates (as some claim). The only recent change big enough to cause a splash publicly is AI, which is a pandora's box.
  When employers cannot differentiate the gifted from the non-gifted, there is no work for the intelligent, and these people always have more options than others. They'll leave their chosen profession if they can't find work, and will be unlikely to return to it even if things turn around later.
  Intelligent people always ask the question about should they be doing something, whereas evil (destructive/blind) people focus on can they do something.
  The main difference is a focus on controlling the consequences of their actions so they don't destroy their children's future.
petesergeant 6 days ago
I'm going through almost exactly this process at the moment, and this article is excellent. Aligns with my experience while adding a bunch of good ideas I hadn't thought of / discovered yet. A+, would read again.
firejake308 5 days ago
> The real value of this process is looking at your data and doing careful analysis. Even though an AI judge can be a helpful tool, going through this process is what drives results. I would go as far as saying that creating a LLM judge is a nice “hack” I use to trick people into carefully looking at their data!
Interesting conclusion. One of the reasons I like programming is that in order to automate a process using traditional software, you have to really understand the process and break it down into individual lines of code. I suppose the same is true for automating processes with LLMs; you still have to really understand the process and break it down into individual instructions for your prompt.
bzmrgonz 6 days ago
This is a brilliant write up, very thick but very detailed, thank you for taking the time(assuming you didn't employ AI.. LOL). So listen, assuming you are the author, there is an open source case management software called arkcase. I engaged them as a possible flagship platform at a lawfirm. Going thru their presentation, I noticed that the platorm is extremely customizable and flexible. So much so, that I think that in itself is the reason people don't adopt it in droves. Essentially too permissive. However, I think it would be a great backend component to a "rechat" style LLM front end. Is there such a need? To have a backend data repository that interacts with a front-end LLM that employees interact with in pure prose and directives? What does the current backend look like for services such as rechat and other chat based LLM agents? I bring this up, because arkcase is so flexible that i can work in broad industries and needs, from managing a highschool athletic department(dosier and bio on each staff and players) to the entire US OFFICE OF PERSONNEL(ALFRESCO AND ARKCASE for security clearance investigation). The idea would be that by introducing an agent LLM as front end, the learning curve could be flatten and the extrem flexibility can be abstracted.