The point of this study is that it suggests that fully AI-automated mammography can currently deliver 70% sensitivity in detecting breast cancer using this model. It does not enable us to compare AI to unaided human performance. As this study did not include healthy controls, there is no false positive rate. The false positive rate is a crucial missing metric, since the vast majority of women do not have breast cancer.
In nearly half the false negatives from both the mammogram and DWI datasets, the cancer was categorized as occult by two breast radiologists, meaning the cancer was invisible to a trained eye. The AI model's non-occult false negative rate on the mammography data is 19.3%.
For that 19.3% figure, see Table 2: 68 non-occult in AI-missed cancer, 285 non-occult in AI-detected cancer.
This study did not compare the AI to a radiologist on a mixed set of healthy and cancer images.
As it was retrospective study, I really hope they made sure that test images were not in a training set of the algorithm. If they were, the whole study is meaningless.
What sensitivity / specificity are trained radiologists able to receive?
Great question. It prompted me to search for and find a history of sensitivity in mammography[1].
Their conclusion is that 39% is supported by evidence. Furthermore, there is a persistent erroneous belief that mammographic sensitivity is 90-95%.
I'm not a medical researcher, but I am a computer guy; I was struck by something very different in the papers - the abstracts at least refer to "AI CAD" as what they're testing - no software information, no versioning - on the CS side, this stuff is of paramount importance to make sure we know how the software performs.
On the medical side, we need statistically significant tests that physicians can know and rely on - this paper was likely obsolete when it was published, depending on what "AI CAD" means in practice.
I think this impedance mismatch between disciplines is pretty interesting; any thoughts from someone who understands the med side better?
The link is essentially a press release. The information you want is (sorta) in the actual paper it describes *.
"The images were analyzed using a commercially available AI-CAD system (Lunit INSIGHT MMG, version 1.1.7.0; Lunit Inc.), developed with deep convolutional neural networks and validated in multinational studies [1, 4]."
It's presumably a proprietary model, so you're not going to get a lot more information about it, but it's also one that's currently deployed in clinics, so...it's arguably a better comparison than a SOTA model some lab dumped on GitHub. I'd add that the post headline is also missing the point of the article: many of the missed cases can be detected with a different form of imaging. It's not really meant to be a model shoot-out style paper.
* Kim, J. Y., Kim, J. J., Lee, H. J., Hwangbo, L., Song, Y. S., Lee, J. W., Lee, N. K., Hong, S. B., & Kim, S. (2025). Added value of diffusion-weighted imaging in detecting breast cancer missed by artificial intelligence-based mammography. La Radiologia medica, 10.1007/s11547-025-02161-1. Advance online publication. https://doi.org/10.1007/s11547-025-02161
Its interesting to see this valid argument raised against this use of AI to identify breast cancer. The lack of control groups is one of the more common concerns raised related to vaccines as well, the argument lands like a lead balloon there.
Because it's generally unethical to not give someone a treatment known already to be safe and effective. Studies of new vaccines where there is not an existing vaccine _do_ use placebo controls. Heck, my son got placebo during moderna's pediatric covid vaccine trial (to our frustration. grin.)
Subsequent trials generally compare against the best known current treatment as the control instead.
This study has no such concerns. It's ethical to include images of non-cancerous breast tissue. The things are not comparable.
The covid vaccines were a whole different beast, though interesting case studies they were done under emergency authorization and didn't follow standard protocols.
Vaccine studies today almost always use a previously approved vaccine as the "control" group. That isn't a true control and if you walk back the chain of approvals you'd be hard pressed to find a starting point that did use proper control groups.
Anyway, my point here wasn't to directly debate vaccines themselves, only to point out that its interest to me as someone without a career in health to see the same effective argument used in two different scenarios with drastically different common responses.
Right, but the people making the argument about vaccines don't understand the principles, because they're actually the same!
1) a double blind RCT with a placebo control is a very good way to understand the effectiveness of a treatment.
2) it's not always ethical to do that, because if you have an effective treatment, you must use it.
Even without a placebo control you can still estimate both FN and FPs through careful study design, it's just harder and has more potential sources of error. A retrospective study is the usual approach. Here, the problem is they only included true positives in the retrospective study, so they missed the opportunity to measure false positives.
And the problem with -that- is that it's very easy to have zero false negatives if you always say " it's positive". Almost every diagnostic instrument has something we call a receiver operating curve that trades off false positives for false negatives by changing the sensitivity for where you decide something is a positive. By omitting the false negatives, they present a very incomplete picture of the diagnostic capabilities.
(In medicine you will often see the terms "sensitivity" and "selectivity" for how many TPs you detect and how many TNs you call negative. It's all part of the same type of characterization.)
The two points you raise with regards to why vaccine or similar studies may be treat special, it doesn't replace the loss of data when a double blind study with a control or make estimates based on modelling indicate anything more than correlation.
We may broadly agree that submitting a control group to a placebo treatment for a particular disease is immoral, but that doesn't mean such a study isn't necessary to prove out the efficacy or safety of the treatment. As for modelling, for example trying to estimate FN and FP, it can only ever indicate correlation at best and will never indicate likely causation.
But it's not. You can do an RCT of the new treatment vs the old treatment. You won't get a direct measure of its absolute efficacy but you will know if it's superior/ non-inferior to the best known thing. And then you can use observational techniques to estimate the absolute values. That's exactly what you would do if you wanted to develop, say, a new flu vaccine that you thought would outperform current vaccines. You get the most important information: whether or not we should switch to the new one.
If you have a new vaccine for a disease for which there is no existing vaccine you do a standard placebo controlled RCT which gives you a direct, high quality measurement of efficacy and side effects.
Vaccine studies use a different experimental design known as “longitudinal,” meaning they follow people over time. This study did not do that. It’s still a valid design, just limited in what it tells us.
> Its interesting to see this valid argument raised against this use of AI to identify breast cancer. The lack of control groups is one of the more common concerns raised related to vaccines as well, the argument lands like a lead balloon there.
Not just vaccines, in each study on the effectiveness of a drug, especially when dealing with potentially life-threatening conditions, the same question is posed. From[0]:
. . . ethical guidance permit the use of placebo controls in randomized trials when scientifically indicated in four cases: (1) when there is no proven effective treatment for the condition under study; (2) when withholding treatment poses negligible risks to participants; (3) when there are compelling methodological reasons for using placebo, and withholding treatment does not pose a risk of serious harm to participants; and, more controversially, (4) when there are compelling methodological reasons for using placebo, and the research is intended to develop interventions that can be implemented in the population from which trial participants are drawn, and the trial does not require participants to forgo treatment they would otherwise receive.
The original study: https://link.springer.com/article/10.1007/s11547-025-02161-1
It was retrospective-only, i.e. a case series on women who were known to have breast cancer, so there were zero false negatives and zero true negatives, because all patients in the study truly had cancer.
The AI system used was a ConvNet used commercially circa 2021, which is when the data for this case series were collected.
> It was retrospective-only, i.e. a case series on women who were known to have breast cancer, so there were zero false negatives and zero true negatives, because all patients in the study truly had cancer.
Well yes, that's the denominator for determining selectivity, which is what the headline claim is about.
Also, they need to set up their next paper:
> However, the retrospective, cancer-only design limits generalizability, highlighting the need for prospective multicenter screening trials for validation.
>The AI system used was a ConvNet used commercially circa 2021, which is when the data for this case series were collected.
Does this mean that newer AI systems would perform significantly differently?
Strictly in terms of architecture, CNNs are still SOTA for small data visual tasks, especially when the target is a locally specific phenomenon where global context isn't as necessary. It has good inductive bias for this.
The main known way to improve performance on tasks like this is getting more data.
Well, certainly not. We shouldn’t draw conclusions about modern AI systems from multi-generation old systems: one way or the other.
Not at all. There is no implication, implicit or explicit, that anything in the world is better or worse. It is just a statement of fact.
>better or worse.
Please quote where I used either word.
> there were zero false negatives
Wouldn't this mean that AI identitied them all has having cancer?
They did all have cancer
Yes, but then the study result should be, "AI correctly identifies 100% of breast cancers in study"
If we're saying there was a discrepancy and we're saying that all of the patients had cancer, then it would seem that there must have been some that were identified as not having cancer by AI.
2021 is an eternity in AI industry.
Edit: I have a problem with the way the title uses "AI" as a singular unchanging entity. It should really be "An AI system misses nearly...". There is no single AI and models are constantly improving - sometimes exponentially better.
I believe there's a big issue in the US of over-diagnosing breast cancer too. "Known to have breast cancer" might not be so clear cut.
That statement would benefit from a link to your source.
The title bothers me. It suggests to me that "AI" is a single thing. If two guys are tested and turn out to be not that great at reading MRI images, should the headline be "Male radiologists miss nearly one-third of breast cancers"?
If it said "AI something", I'd be fine with it. It's a statement about that something, not about AI in general. Use it as an adjective (short for "AI-using" I guess?), not a noun.
They will directly write that "Radiologists miss nearly one-third of breast cancers."
I trust the meaning of this article is just that it requires hospitals to rethink their decision to substitute all doctors today.
No hospital is deciding that. People have been testing whether we can replace radiologists with AI for over 10 years.
If they test someone with no background in radiology, they could even make the headline "Humans miss 50% of breast cancers"
I wouldn't miss any of them. "Idk what I'm looking at but click positive and let someone who does sort it out this is way too dangerous".
AI doesn't have that option yet.
There are more radiologists than AI models that read MRIs.
> It suggests to me that "AI" is a single thing.
But it is. It's LLMs. There is no other "AI".
Haven't you read HN in the past 1-2 years?
This seems a bit like a needlessly publicized finding. Surely our baseline assumption is that there are lots of systems that aren't very good at finding cancer. We're interested in findings that are good. You only need 1 good system to adopt. Yes, it's good scientific hygiene to do the study and publish it going "Well, this particular thing isn't good let's move on". But my expectation is you just going until you design a system that does do well and then adopt that system.
If I pluck a guy off the street, get him to analyze a load of MRI scans and he doesn't correctly identify cancer from them I'm not going to publish an article saying "Humans miss X% of breast cancers" am I.
I think finding that AI or at least specific model sold to be able to do something can't reasonable do it is entirely reasonable thing to publish.
In the end it is on the model marketer to prove that what they sell can do what it says. And counter examples is fully valid thing to then release.
I've been adjacent to this field for a while, so take this for what it is. My understanding that the developing a system that can accurately identify a specific form or sub-form of cancer to a degree equal or better than a human is doable now. However, developing a system that can generalize to many forms of cancer is not.
Why does this matter? Because procurement in the medical world is a pain in the ass. And no medical center wants to be dealing with 32 different startups each selling their own specific cancer detection tool.
Many people are confused and think the Bitter Lesson is that you can just feed up a bigger and bigger model and eventually it becomes omnipotent.
They promised us that AI was The Solution. Now they have to deliver.
If the TechBros fail us here, we may then assume they may fail us everywhere else as well.
And we'd be wrong about that. Different domains are showing wildly different characteristics, with some ML models showing superhuman or expert level performance in some domains (chess, face and handwriting recognition for example) and promising but as yet just not good enough in other domains (radiography, self-driving cars, question answering, prose writing). Currently coding is somewhere in the middle; superhuman in some ways, disappointingly unusable in others.
I don't we can make any conclusive verdict about the promise of ML for radiography right now; the life-critical nature of the application it's in the unusable middle, but it might get better in a few years or it might not. Time will tell.
The description from the summaries sound very flawed.
1. They only tested 2 Radiologists. And they compared it to one model. Thus the results don’t say anything about how Radiologists in general perform against AI in general. The most generous thing the study can say is that 2 Radiologists outperformed a particular model.
2. The Radiologists were only given one type of image, and then only for those patients that were missed by the AI. The summaries don’t say if the test was blind. The study has 3 authors, all of which appear to be Radiologists, and it mentions 2 Radiologists looked at the ai-missed scans. This raises questions about whether the test was blind or not.
Giving humans data they know are true positives and saying “find the evidence the AI missed” is very different from giving an AI model also trained to reduce false positives a classification task.
Humans are very capable at finding patterns (even if they don’t exist) when they want to find a pattern.
Even if the study was blind initially, trained humans doctors would likely quickly notice that the data they are analyzing is skewed.
Even if they didn’t notice, humans are highly susceptible to anchoring bias.
Anchoring bias is a cognitive bias where individuals rely too heavily on the first piece of information they receive (the "anchor") when making subsequent judgments or decisions.
They skewed nature or the data has a high potential to amplify any anchoring bias.
If the experiment had controls, any measurement error resulting from human estimation errors could potentially cancel out (a large random sample of either images or doctors should be expected to have the same estimation errors in each group). But there were no controls at all in the experiment, and the sample size was very small. So the influence of estimation biases on the result could be huge.
From what I can read in the summary, these results don’t seem reliable.
Am I missing something?
The did NOT test radiologists. There were NO healthy controls. They evaluated AI false negative rate and used exclusively unblinded radiologists to grade the level of visibility and other features of the cancer.
Utility of the study is to evaluate potential AI sensitivity if used for mass fully automated screenings using mammography data. But says NOTHING about the CRUCIAL false positive rate (no healthy controls) and NOTHING about AI vs. human performance.
See my main comment elsewhere in this threat.
Huh? I was commenting that there were no controls and the doctors were given skewed data, so any conclusions of ai ability vs Dr ability seem misplaced. Which seems to be what you just said… so I am confused about what I said that was inaccurate.
Can you clarify?
I also hinted at the fact that I only had access to the posted summary and the original linked article, and not the study. So if there is data I am missing… please enlighten me.
I was just reinforcing that point as your comment was worded in a way that left room for doubt. Sorry if this came across as critical toward you or implying you held a different interpretation.
This article is about measuring how often an AI missed cancer by giving it data only where we know there was cancer.
> Am I missing something?
Yes. The article is not about AI performance vs human performance.
> Humans are very capable at finding patterns (even if they don’t exist) when they want to find a pattern
Ironic
The article has the headline "AI Misses Nearly One-Third of Breast Cancers, Study Finds".
It also has the following quotes:
1. "The results were striking: 127 cancers, 30.7% of all cases, were missed by the AI system"
2. "However, the researchers also tested a potential solution. Two radiologists reviewed only the diffusion-weighted imaging"
3. "Their findings offered reassurance: DWI alone identified the majority of cancers the AI had overlooked, detecting 83.5% of missed lesions for one radiologist and 79.5% for the other. The readers showed substantial agreement in their interpretations, suggesting the method is both reliable and reproducible."
So, if you are saying that the article is "not about AI performance vs human performance", that's not correct.
The article very clearly makes claims about the performance of AI vs the performance of doctors.
The study doesn't have the ability to state anything about the performance of doctors vs the performance of AI, because of the issues I mentioned. That was my point.
But the study can't state anything about the sensitivity of AI either because it doesn't compare the sensitivity of AI based mammography (XRay) analysis with that of human reviewed mammography. Instead it compares AI based mammography vs human based DWI when the humans knew the results were all true positives. It's both a different task ("diagnose" vs "find a pattern to verify an existing diagnosis") and different data (XRay vs MRI).
So, I don't think the claims from the article are valid in any way. And the study seems very flawed.
Also, attempting to measure sensitivity without also measuring specificity seems doubly flawed, because there are very big tradeoffs between the two.
Increasing sensitivity while also decreasing specificity can lead to unnecessary amputations. That's a very high cost. Also, apparently studies have show that high false positive rates for breast cancer can lead to increased cancer risks because they deter future screening.
Given that I don't have access to the actual study, I have to assume I am missing something. But I don't think it's what you think I'm missing.
Shouldn't A.I. not be used in a way that it only tries to assist? E.g. a doctor takes a look first and if (s)he can't find anything then A.I. is checking as well (or in parallel).
My personal opinion: AI should be still kept out of anything mission critical, in all stages, except for evaluation.
There is other comment very correctly noting that this result is on 100% positive input. Same AI in “real life” would score probably much better eventually. But as you point out, if used as a confirmation tool, is definitely bad.
> Same AI in “real life” would score probably much better eventually
Either I don't understand your reasoning or you are very much wrong. A "real life" dataset would contain real negatives too and the result would be equal if false positive rate was zero and strictly worse if the rate was any higher. One should expect the same AI to score significantly worse in a real life setting.
Depends on what you call better or worse. In real life positives (TN) are far less common than negatives (TN), if this system does not have lots of FP (which is very possible), the accuracy will be much better than you may expect.
What I mean with “score” is having a relatively high accuracy.
Come let’s do the math: incidence of BC is 1 every 12, lets say. Now let’s say we have 12000 patients:
Acuracy = (TP + TN) / (TP + TN + FP + FN) = (1000 + 11000) / (1000 + 11000 + 300 + 0) = 12000 / 12300 =0.976 the test is 97.6% accurate… pretty impressive huh?
Tell me if I’m wrong. Is a know fact that you have to be careful when doctor speak of % accuracy.
There was a study that found that, in radiology, human-first assessment resulted in worse outcomes that human-alone. Possibly the human's letting borderline cases through, on the assumption that the machine will catch them.
There's a roundup of such findings here, but they're a mixed bag: https://www.uxtigers.com/post/humans-negative-value I suspect you need careful process design to get better outcomes, and it's not one-size-fits-all.
In case you missed it, they weren't using AI to make these diagnoses.
In the human follow-up, there was an improvement but there was still a gap:
> Their findings offered reassurance: DWI alone identified the majority of cancers the AI had overlooked, detecting 83.5% of missed lesions for one radiologist and 79.5% for the other.
The combination of AI and this DWI methodology seems to identify most of the cancer, but there’s still about 20% of 1/3 that gets missed. I assume that as these were confirmed diagnoses, they were caught with another method beyond DWI.
Please always present the confusion matrix. One number is (almost) useless.
I can detect 100% by
def detect(x):
Return TrueThey only had positive samples, so they could and did only report true and false positives
While they only had positive samples, the AIs sometimes reported cancer in the wrong location, meaning a double whammy: It failed to detect the cancer, and misdiagnosed a non-cancer.
From the paper:
> Two cancers had abnormality scores greater than 10 but were not correctly localized and were therefore categorized as AI-missed.
There were no healthy controls, so they can only measure sensitivity, not specificity in this design.
100% of the titles are not specific enough on sensitive matters.
This fails actually (assuming it's Python), "return" needs smaller case "r". But you could rewrite it in Haskell or Rust for safety.
Useful to show the failure rate for humans, and humans assisted by systems.
No, this design is not capable of showing this. It did not compare to unblinded humans, did not provide AI with the same data used to make the initial diagnosis, and did not include healthy controls. It measures sensivitity only.
The missed cases should be attributed to the specific model deployed in the product, not to AI as a general concept. Framing this limitation under a broad and alarming title is therefore misleading and unnecessary.
okay guys, I developed AI mammo screening product, let me clear things up. you read it wrong, and I don't blame you. I doubt whoever wrote this actually have a good understanding of the numbers.
the setup: 1. 400s confirmed patients 2. AI reads Mammography ONLY and missed 1/3 3. on those AI missed patients, radiologists do a second read on MRI, which is the gold standard for differential. evidence: the referenced paper at the bottom <Added value of diffusion-weighted imaging in detecting breast cancer missed by artificial intelligence-based mammography.>
So, the whole point it (or its reference paper) is trying to make is: Mammography sucks, MRI is much better, which is a KNOWN FACT.
Now, let me give you some more background missing from the paper: 1. Why does Mammography suck? well, go google/gpt some images, its essentially X-ray for the breast, which compress 3D volumes into 2D average poole plane, which is infomation lossy. SO, AI or not, the sensitivity is limited by the modality. 2. How bad/good is Mammography AI? I would say 80~85% sensitivity agaist very thorough+experienced radiologist without making unbearable amount of FP, that probably translates to 2/3 sensitivity against real cancer cohert, so the referenced number is about right. 3. Mammography sucks, what's the point? its cheap AND fast, you can probably do walk-in and get interpretation back in hours. Whereas MRI you probably need to schedule 2 weeks ahead if not MORE. For a yearly screening, it works for the majority of polulation.
and final pro tip, 1. breast tumor is more pravelent than you think (maybe 1 in 2 by age 70+) 2. most guides recommend women with 45+ do yearly checkup 3. if you have dense breast (basically small and firm), add ultrasound screening to make sure. 4. breast feeding does good for both the mother and child, do that.
peace & love
Radiologist here, thanks for posting this because I was biting my tongue.
I'll supplement by directing others to consider how number needed to screen may be a more useful metric than mammographic sensitivity when making policy decisions. They're related, obviously, but only one of them concerns outcomes.
"AI" doesn't exist. There are probably hundreds of different breast cancer detection algorithms. Maybe the SOTA isn't good enough yet. That doesn't mean AI in general is fundamentally incapable of correctly detecting it.
It is a useful umbrella term for all the systems in question.
It seems this study tested a single one of them.
These reports often misrepresent scientific data, I think that's a given.
Whats the baseline? How many did a human being get? We need to compare this to baseline to know if its good or bad
As there were no healthy controls or comparison with blinded radiologist performance, it is not possible to answer those questions using this study. The point was to evaluate AI sensitivity, which is useful. However, without healthy controls, it is not possible to determine specificity, which is a crucial statistic since most women do not have breast cancer.
As someone else has pointed out, I would like to know how this compares to humans.
I hope something good comes out of this, as I have known women whose lives were deeply affected by this.
So basically AI kills people.
This is Skynet 2.0 or 3.0. But shit. James Cameron may have to redo The Terminator, to include AI. Then again, who would watch such a movie?
Obviously we need more breasts "for training."
Not all AIs are created equal.
The moving of the goal posts is a lot here. Measuring sensitivity only is still a useful thing and at least aids radiologists in their decision to use this model specifically and how to base their reliance on it. Also, why does every study have to compare all humans across human history to some particular model?
This is a terrible article.
"One AI is not great" is not an interesting finding and certainly not conclusive of "AI can't help or do the job".
It's like saying "some dude can't detect breast cancer" and suggesting all humans are useless.
AI finds nearly 2/3rds of breast cancers!
I see you're not in the marketing department! We can do better by only considering "missed" as what wouldn't also be missed by a human: AI finds 71% of breast cancer*!
*Compared to a human.
Depending how the costs of AI detection vs doctor, that genuinely might be enough to shift the math and be a net positive. If it is cheap enough to test 10x the current tested population, which would have lower, but non-zero rates of breast cancer, then[0] AI would result in more cancer detected and therefore more aggregate lives saved.
[0]presumptively
Given that every positive case needs to be verified by a doctor anyway because the patient has breast cancer, and every negative case has to be checked because it does a worse job than traditional methods... It only costs more.
Depends on the false positive rate. Hypothetically one can 'just' tune the model so false positives are low. This will increase false negatives but those are 'free' as they don't require follow ups. So long as the decrease in cost per real positive[0] goes down there's a benefit to be had.
[0] accounting for false positives, screening costs for true negatives, etc. etc.
> This will increase false negatives but those are 'free' as they don't require follow ups.
Increase in false negative rate significantly reduces survival rate and increases cost of treatment. We have huge multiplication factor here so decreasing false negative rate is the net positive option at relatively low rates.
> Depending how the costs of AI detection vs doctor, that genuinely might be enough to shift the math and be a net positive.
Based on my very superficial medical understanding, screening is already the cheap part. But every false-positive would lead to a doctor follow up at best and a biopsy at worst. Not to mention the significant psychological effects this has on a patient.
So I would counter that the potential increase of false-positive MRI scans could be enough to tip off the scale to make screening less useful