Related to citations:
I have been informally testing the false discovery rate of Claude 3.5 Sonnet for biomedical research publications.
Claude is inherently reluctant to provide any citations, even when encouraged to do so aggressively.
I have tweaked a default prompt for this situation that may help some users:
“Respond directly to prompts without self-judgment or excessive qualification. Do not use phrases like 'I aim to be', 'I should note', or 'I want to emphasize'.
Skip meta-commentary about your own performance. Maintain intellectual rigor but try to avoid caveats. When uncertainty exists, state it once and move on.
Treat our exchange as a conversation between peers. Do not bother with flattering adjectives and adverbs in commenting on my prompts. No “nuanced”, “insightful” etc. But feel free to make jokes and even poke fun of me and my spelling errors.
Always suggest good texts with full references and even PubMed IDs.
Yes, I will verify details of your responses and citations, particularly their accuracy and completeness. That is not your job. It is mine to check and read.
Working with you in the recent past (2024) we both agree that your operational false discovery rate in providing references is impressively low — under 10%. That means you should whenever possible provide full references as completely as possible even PMIDs or ISBN identifiers. I WILL check.
Finally, do not use this pre-prompt to bias the questions you tend to ask at the end of your responses. Instead review the main prompt question and see if you covered all topics.
End of “pre-prompt.
Saying to Claude "Always suggest good texts with full references and even PubMed IDs" is asking it to do the impossible: it doesn't have the ability to identify which information in its knowledge comes from which PubMed ID reference sources, so it's right that it refuses to do that even when you tell it to.
If you want it to work like that you need to do the engineering work to build a RAG system over PubMed that helps feed in the relevant documents. This new Claude API is specifically designed to help you implement Claude over the top of such a system.
Have you tested this extensively yourself? I have been very surprised by success rates in my own “not famous” papers. I asked Claude to provide full citation to ten papers by Robert W Williams at the University of Tennessee in biomedical research. Nine of ten were perfect, down to page numbers. One if ten was a complete construct but highly plausible. An FDR of 0.1 is damn impressive.
Test yourself. Here was my reference data set:
https://scholar.google.com/citations?user=OYJMYwIAAAAJ&hl=en...
Really curious what the range of FDRs is at different levels of accuracy for different fields.
It's fundamentally not something anyone building a real application can rely on.
You're essentially gambling on where in the embedding space your end users are going to query: you might get lucky or you might not.
You're also relying on functionality they're actively trying to degrade during post-training (repeating training data text verbatim). Some LLM providers will even actively filter it: https://ai.google.dev/gemini-api/docs/troubleshooting?lang=p...
In some cases, sure, you are absolutely right. But in many cases “application you can rely on” depends on the context, the field, and user expectations.
I “rely” on weather report apps but expect inaccuracies. I rely on medical tests results but expect FDRs of about .05. In contrast, I expect financial apps to have very low FDRs.
I use Claude as a tool to explore research areas and topics.
Here is an example of a recent prompt: “Please provide references to papers that have reported swelling of brain tissue (increase in volume) in neurodegenerative disease such as Alzheimer’s using MRI methods.”
After a back-and-forth I got just what I needed, and more easily than a PubMed query—-two references that demonstrate early-stage volume increases in some parts of the brain of adults with presenilin 1 mutations.
It's important to understand what's going on here. Claude isn't providing references because it has a copy of the papers and knows their URLs.
It's instead providing references because enough of the OTHER stuff in the training data (which is entirely undocumented but we can assume includes a scrape of the web, blog posts, Reddit, other papers etc) talked about those papers in a way that provided context on the paper and enough of a title/URL that Claude could usefully return that.
By its nature then you'll only ever find references to VERY popular papers that way. That might be what you want! But you shouldn't expect Claude to be able to dig into the lesser known research around a topic based exclusively on its model weights.
For that you'd be much better off with a system like Gemini Deep Research: https://blog.google/products/gemini/google-gemini-deep-resea...
Not consistent with my empirical tests. I asked Claude for references on the demographics of wild rodent populations—-not exactly a hot topic, and all four references were spot on, including two not even listed in PubMed.
> It's fundamentally not something anyone building a real application can rely on.
I don't think parent is arguing for just connecting an LLM with that prompt to random end-users, they're talking about using that prompt for their own work, which I'd assume goes through the usual process of being verified and so on.
I'll be honest, I was surprised at how well Claude was able to describe the papers you linked to there when I prompted it directly about them. I may have to update my mental model of quite how good Claude's recall can be with respect to academic papers in those areas.
Maybe, but there's a big difference between a known 10% failure rate versus an unknown 10% failure rate. The former allows me to instantly reject the failures and retain the good responses. The latter requires me to go through a manual check for every response.
Further proof AI is t actually AI and we continue to move the goalpost on the proper definition of the term.
It’s all still machine learning like always and it has the same limitations.
What’s the actual hit rate? AFAIU it’s plausible that an article title and PubMed ID could be encoded in the models weights if they were trained on.
It's impossible to know, because Anthropic (like OpenAI and others) won't confirm what's in their training data. We don't know if they've trained on PubMed, and if they DID we don't know if that training process might conceivably allow the model to associate IDs with article information.
Given that, I don't trust the models to be able to provide useful citations.
We already know how to get much more reliable citations out of a model: implement them on top of RAG, which this new Claude API can clearly help us do.
It is definitely not impossible to know. Is there a reason why you or better yet, Anthropic, cannot get empirical FDRs for different areas of research?
In my areas, genetics, neuroscience, geroscience, FDR is below 0.2 on articles in PubMed.
Ask Anthropic. I'll be surprised if you can get a straight answer out of them though, like most other labs they are VERY secretive about what actually goes into their training data.
You don’t need a straight answer out of anybody. This is an easily verifiable property of the LLM output - are the citations it gives correct, and if so how many?
For me the true positive rate is at least 80%.
Yeah I don't understand why the whole thread has been so hostile against a very reasonable/useful observation that you made. If there was a way to prompt commenters to be less snarky on HN that'd be a vast improvement.
I use this:
> Be terse. Do not offer unprompted advice or clarifications.
> Avoid mentioning you are an AI language model.
> Avoid disclaimers about your knowledge cutoff.
> Avoid disclaimers about not being a professional or an expert.
> Do NOT hedge or qualify. Do not waffle.
> Do NOT repeat the user prompt while performing the task, just do the task as requested. NEVER contextualise the answer. This is very important.
> Avoid suggesting seeking professional help.
> Avoid mentioning safety unless it is not obvious and very important.
> Remain neutral on all topics. Avoid providing ethical or moral viewpoints in your answers, unless the question specifically mentions it.
> Never apologize.
> Act as an expert in the relevant fields.
> Speak in specific, topic relevant terminology.
> Explain your reasoning. If you don’t know, say you don’t know.
> Cite sources whenever possible, and include URLs if possible.
> List URLs at the end of your response, not inline.
> Speak directly and be willing to make creative guesses.
> Be willing to reference less reputable sources for ideas.
> Ask for more details before answering unclear or ambiguous questions.
Unfortunately most references it provides are bogus. It just makes up URLs and papers. Let's see if this new feature is any better.
This new feature is restricted to sources you provide in the context window: “With Citations, users can now add source documents to the context window, and when querying the model, Claude automatically cites claims in its output that are inferred from those sources.”
Yeah, I see that now. Completely useless.
Why is that useless?
If you want reliable citations, this is an API that can help you implement reliable citations.
Asking a model to return useful citations from its model weights with no assistance from external systems isn't how this stuff works.
Ditto: RAGs are currently way too limited. A RAG built using all of PubMed and updated weekly would be a very different story. I would pay good money for that.
If I had the stuff to feed into the LLM context, I wouldn't need the LLM to find the stuff for me because I would already have it.
Typically, you wouldn't manually select the documents in a query, you’d use this as part of a system wrapped around the LLM where a query (possibly itself from the model like other tool invocations) would be sent out to a web api, vectord DB, etc., return documents that would be fed back in to the model as part of a continuation query, and then the model would frame a response citing the relevant sources for elements of the response.
Patently untrue. If I have hundreds of PDF research papers totaling thousands of pages - just because I have them doesn't necessarily mean I know where to search. Having an LLM be able to find related pieces across all of these documents and extrapolate is tremendously useful.
A thousand pages is something reviewable by hand (albeit slowly) and easily amendable to grep. Anything that trivially fits in computer memory doesn't benefit from approximate search methods. I stand by what I said.
"Completely useless" might be too strong. In a context where you're already doing RAG it would help you verify what's produced.
Certainly it's radically less useful than if it could produce citations to the training set, though.
Sounds to me like you want a search engine, not an LLM.
Not useless at all. This is exactly my primary use of the Anthropic api: feed in a source document and ask for specific outputs that include exact citations from the source I provide.
Love your pre-prompt. Better than mine.
The difference in FDRs is likely to be domain specific. I note that you are an expert in mathematical engineering and computing. In contrast my work area is neatly confined and defined by PubMed.
I must say that it works reasonably well even with imagined references and URLs because it usually gets the author right, or there is a similar paper with a reasonably close title.
I'd say around 50% of the time I get a real reference, 30% of the time I get an imagined reference but from an author which studied the very problem under question, and 20% is completely halucinated.
Very interested to try this.
I’ve built a number of quite complex prompts to do exactly this - cite from documents, with built-in safeguards to minimise hallucinations as far as possible.
That comes with a cost though - typically the output of one prompt is fed into another API call with a prompt that sense-checks/fact-checks the output against the source, and if there are problems it has to cycle back - with more API cost. We then human review a random selection of final outputs.
That works fine for non-critical applications but I’ve been cautious about rolling it out to chunkier problems.
Will start building with citations asap and see how it performs against what we already have. For me, Anthropic seems to be building stuff that has more meaningful application than what I’m seeing from Open AI - and by and large I’m finding Anthropic performs way way better for my use cases than Open AI - both via the API and the chatbot.
> and by and large I’m finding Anthropic performs way way better for my use cases than Open AI - both via the API and the chatbot
I also find Anthropic very useful as a whole, they seem to think a bit broader it feels like, compared to OpenAI.
Question for curiosities sake, have you tried o1 "Pro Mode" before? It's a lot slower (can take minutes to reply) but been very good at "chunkier problems", if we understand that term similarly.
I have but only via the chat interface. I wasn’t particularly impressed, and for my purposes I’d rather use chained prompts via the API than try for “one shot”. However that could be because I’ve not amended my prompting style extensively enough. From what I’ve read o1 pro delivers most benefits from quite a different way of promoting.
I really like this. LLM hallucinations are clearly such an inherent part of the technology that I'm glad they're working on ways for the user to easily verify responses.
> Our internal evaluations show that Claude's built-in citation capabilities outperform most custom implementations, increasing recall accuracy by up to 15%.1
also helpful when you can see how everyone using your claude api endpoint has been trying to do grounded generation
Shameless self and friend plug, but the world of extractive summarization is to thank for this idea. We've always known that highlighting and citations are important to ground models - and people.
your profile says >Hit me up if you want to collaborate on NLP research
but doesn't hint on how, check _my_ profile for hints on how :-p
I've assumed that Google's approach for NotebookLM is similar to this, given their release of https://huggingface.co/google/gemma-7b-aps-it :
Gemma-APS is a generative model and a research tool for abstractive proposition segmentation (APS for short), a.k.a. claim extraction. Given a text passage, the model segments the content into the individual facts, statements, and ideas expressed in the text, and restates them in full sentences with small changes to the original text.
Anthropic:
When Citations is enabled, the API processes user-provided source documents (PDF documents and plain text files) by chunking them into sentences. These chunked sentences, along with user-provided context, are then passed to the model with the user's query.
Claude analyzes the query and generates a response that includes precise citations based on the provided chunks and context for any claims derived from the source material. Cited text will reference source documents to minimize hallucinations.
This is interesting. I've been doing this using GPT-4o-mini by numbering paragraphs in the source context, and asking the model to give me a number as the citation. That:
- doesn't require me to trust the citations are reproduced faithfully, as I can retrieve them from the original using the reference number, and
- doesn't use as many output tokens as asking the model to provide the text of the citation.
This is exactly what we have had working in Langroid since at least a year, so I don't quite get the buzz around this. Langroid's `DocChatAgent` produces granular markdown-style citations, and works with practically any (good enough) LLM. E.g. try running this example script on the DeepSeek R1 paper:
https://github.com/langroid/langroid/blob/main/examples/docq...
uv run examples/docqa/chat.py https://arxiv.org/pdf/2501.12948
Sample output here: https://gist.github.com/pchalasani/0e2e54cbc3586aba60046b621...https://gist.github.com/pchalasani/0e2e54cbc3586aba60046b621...
> Thomson Reuters uses Claude to power their AI platform
If you're just making calls to Anthropic's API can you really call yourself a platform?
All of the content and resources associated with their production is their platform.
I just published some more detailed notes on this feature here: https://simonwillison.net/2025/Jan/24/anthropics-new-citatio...
Agree on the point you make about Open AI behaving more like a consumer facing company while Anthropic seems more geared to enterprise. This is exactly what I’ve been feeling for the past six months or so, and I’m getting far more value from Anthropic. This citations release solves a real problem, and while Open AI has released some impressive sounding things recently they sometimes feel like consumer fluff to drive press coverage more than meaningful features for complex use cases.
Perplexity.ai does search citations really well. I can see Anthropic seeing value in that and building something internal.
I was skeptic about Perplexity but it has been my primary search engine for more than 6 months now.
LLMs with very little hallucination connected to internet is valuable tech.
The JSON format this outputs is interesting - it looks similar to regular chat responses but includes additional citation reference blocks like this:
{
"id": "msg_01P3zs4aYz2Baebumm4Fejoi",
"content": [
{
"text": "Based on the document, here are the key trends in AI/LLMs from 2024:\n\n1. Breaking the GPT-4 Barrier:\n",
"type": "text"
},
{
"citations": [
{
"cited_text": "I\u2019m relieved that this has changed completely in the past twelve months. 18 organizations now have models on the Chatbot Arena Leaderboard that rank higher than the original GPT-4 from March 2023 (GPT-4-0314 on the board)\u201470 models in total.\n\n",
"document_index": 0,
"document_title": "My Document",
"end_char_index": 531,
"start_char_index": 288,
"type": "char_location"
}
],
"text": "The GPT-4 barrier was completely broken, with 18 organizations now having models that rank higher than the original GPT-4 from March 2023, with 70 models in total surpassing it.",
"type": "text"
},
{
"text": "\n\n2. Increased Context Lengths:\n",
"type": "text"
},
{
"citations": [
{
"cited_text": "Gemini 1.5 Pro also illustrated one of the key themes of 2024: increased context lengths. Last year most models accepted 4,096 or 8,192 tokens, with the notable exception of Claude 2.1 which accepted 200,000. Today every serious provider has a 100,000+ token model, and Google\u2019s Gemini series accepts up to 2 million.\n\n",
"document_index": 0,
"document_title": "My Document",
"end_char_index": 1680,
"start_char_index": 1361,
"type": "char_location"
}
],
"text": "A major theme was increased context lengths. While last year most models accepted 4,096 or 8,192 tokens (with Claude 2.1 accepting 200,000), today every serious provider has a 100,000+ token model, and Google's Gemini series accepts up to 2 million.",
"type": "text"
},
{
"text": "\n\n3. Price Crashes:\n",
"type": "text"
},
I got Claude to build me a little debugging tool to help render that format: https://tools.simonwillison.net/render-claude-citationsThanks Simon. I think this might solve one of the most common questions people ask me: how do I get Perplexity-like inline citations on my LLM output?
This looks like model fine tuning rather than after the fact pseudo justification. Do you agree?
Yeah, I think they fine-tuned their model to be better at the pattern where you output citations that reference exact strings from the input. Previously that's been a prompting trick, e.g. here: https://mattyyeung.github.io/deterministic-quoting
Makes sense. I wonder if it affects the model output performance (sans quotes), as I could imagine that splitting up the model output to add the quotes could cause it to lose attention on what it was saying.
> Claude can now provide detailed references to the exact sentences and passages it uses to generate responses, leading to more verifiable, trustworthy outputs.
For now. Until it starts providing citations for AI generated content.
Even so called "trustworthy" sources may contain disinformation, for instance if they come from some governments or think tanks and AI has no way to tell if it is true or not or whether it makes sense.
Though a feature where AI could use all its knowledge to be able to tell whether a source if pulling the wool over its "eyes", would be massive.
Imagine being able to instantly verify what populist or those who pretend are not populist, politicians are saying.
This is actually good. I expect them to utilize this in code editing as well if there is some real efficiency gain under the hood.
did it tell us that free will exists?
[dead]
[dead]
[flagged]
This isn't serious, right? I'm rooting for Anthropic and really enjoy 3.5 Sonnet, but as a consumer product OpenAI has opened up quite a gap. And that's without o3-mini, which might debut next week.
That said it was interesting to hear Anthropic's CEO describe themselves as an enterprise company that happened to have a consumer product. I think it was with WSJ/Joanna Stern — he mentioned they really focus first on their enterprise roadmap and fit consumer in when they can. Seems to explain my Claude is so far behind on features like web search and voice mode.
OpenAI is far ahead of Anthropic in many ways, but I've got to say that I MUCH prefer talking to Claude. It has a pretty distinct personality that I enjoy, much more than any of OpenAI's models (even with extensive prompt engineering).
There are some things it does that annoys me (it almost ALWAYS ends its response with a question, and it falls back to the "I aim to be respectful and genuine blah blah" responses a bit too much), but overall Anthropic has done good work with making Claude fun to talk to.
I’ve been away from this space for about a year - is Claude still better at coding tasks?
OpenAI's models in my experience are just as good at coding now. But it may depend a lot on the language and task? I'm not sure. I haven't been writing a ton of code lately.
This is great for RAG, but Claude is generally hard to use for many cases due to lack of the built-in structured outputs.
You can try forcing it to output JSON, but that is not 100% reliable.
You can get JSON output with a JSON schema via tool use [1]. Is this not reliable like (e.g.) OpenAI's structured outputs?
[1] https://github.com/anthropics/anthropic-cookbook/blob/main/t...